Loading User Supplied PubChem BioAssay Data
This section demonstrates the process for creating a new bioactivity database from user supplied data. As an example, we will demonstrate the process of downloading an assay from the NCBI PubChem BioAssay bioactivity data repository, and loading this into a new database (Wang, et al., 2012).
First, get two files from PubChem BioAssay for the assay of interest: an XML file containing details on how the experiment was performed, and a CSV (comma separated value) file which contains the actual activity scores. For the purposes of this example, we will use the data from assay 1000, which is a confirmatory assay (titration assay) of 57 small molecules against a mevalonate kinase protein. More details on this assay were provided in the “Quick Tutorial,” where the same data is used. These files can be downloaded from PubChem BioAssay at http://pubchem.ncbi.nlm.nih.gov/ or loaded from the example data repository included in this package as follows:
Next, create a new empty database for loading these data into. This example uses the R tempfile function to create the database in a random location. If you would like to keep your resulting database, replace myDatabaseFilename with your desired path and filename.
We will also add a data source to this database, specifying that our data here mirrors an assay provided by PubChem BioAssay.
The XML file provided by PubChem BioAssay contains extensive details on how the assay was performed, molecular targets, and results scoring methods. You can extract these using the parsePubChemBioassay function as follows.
The parsePubChemBioassay function also requires a .csv file which contains the activity scores for each assay, in the standard CSV format provided by PubChem BioAssay.
For data from sources other than PubChem BioAssay, you may need to write your own code to parse out the assay details- or type them in manually.
Next, load the resulting data parsed from the XML and CSV files into the database. This creates records in the database for both the assay itself, and it’s molecular targets.
To load additional assays, repeat the above steps. After all data is loaded, you can significantly improve subsequent query performance by adding an index to the database.
After indexing, perform a test query on your database to confirm that the data loaded correctly.
Lastly, disconnect from the database to prevent data corruption.
Prebuilt Database Example: Investigate Activity of a Known Drug
A pre-built database containing large quantities of public domain bioactivity data sourced from the PubChem BioAssay database, can be downloaded from http://chemmine.ucr.edu/bioassayr.
While downloading the full database is recommended, it is possible to run this example using a small subset of the database, included within the bioassayR package for testing purposes.
This example demonstrates the utility of bioassayR for identifying the bioactivity patterns of a small drug-like molecule. In this example, we look at the binding activity patterns for the drug acetylsalicylic acid (aka Aspirin) and compare these binding data to annotated targets in the DrugBank drug database (Wishart, et al., 2008).
The DrugBank database is a valuable resource containing numerous data on drug activity in humans, including known molecular targets. In this exercise, first take a look at the
annotated molecular targets for acetylsalicylic acid by searching this name at http://drugbank.ca. This will provide a point of reference for comparing to the bioactivity
data we find in the prebuild PubChem BioAssay database. Note that DrugBank also contains the PubChem CID of this compound, which you can use to query the bioassayR PubChem BioAssay
To get started first connect to the database. The variable sampleDatabasePath can be replaced with the filename of the full test database you downloaded, if you would like to use
that instead of the small example version included with this software package.
Next, use the activeTargets function to find all protein targets which acetylsalicylic acid shows activity against in the loaded database.
These target IDs are NCBI Protein identifiers as provided by PubChem BioAssay (Tatusova, et al., 2014). In which cases do these results agree with or disagree
with the annotated targets from DrugBank?
Now we would like to connect to the UniProt (Universal Protein Resource) database
to obtain annotation details on these targets
(Bateman, et al., 2015). The biomaRt Bioconductor package
is an excellent tool for this purpose, but works best with UniProt identifers,
instead of the NCBI Protein identifiers we currently have, so we must
translate the identifiers first (Durinck, et al., 2009; Durinck, et al., 2005).
We will use the translateTargetId function in bioassayR to obtain
a corresponding UniProt identifier for each NCBI Protein identifier (GI).
These identifier translations were obtained from UniProt and come pre-loaded
into the database. The translateTargetId takes only a single query and returns
one or more UniProt identifiers. Here we call it with sapply which automates
calling the function multiple times, one for each protein stored in drugTargets.
In many instances, a single query GI will translate into multiple UniProt
identifiers. In this case, we keep only the first one as the annotation
details we are looking for here will likely be the same for all of them.
Next, we connect to Ensembl via biomaRt to obtain a description for each
target that is a Homo sapiens gene. For more information,
consult the biomaRt documentation. After retrieving these data, we call the
match function to ensure they are in the same order as the query data.
Now we can view this annotation data. NAs represent proteins not found
on the Homo sapiens Ensembl, which may be from other species.
Lastly, let’s again look at our active target list, with the annotation alongside.
Note, these only match up in length and order because in the above code
we removed all but one UniProt ID for each target protein, and then reordered
the biomaRt results with the match function to get them in the correct order.
Identify Target Selective Compounds
In the previous example, acetylsalicylic acid was found to show binding activity against numerous proteins, including the COX-1 cyclooxygenase enzyme (NCBI Protein ID 166897622).
COX-1 activity is theorized to be the key mechanism in this molecules function as a nonsteroidal anti-inflammatory drug (NSAID).
In this example, we will look for other small molecules which selectively bind to COX-1, under the assumption that these may be worth further investigation as potential nonsteroidal anti-inflammatory drugs.
This example shows how bioassayR can be used
identify small molecules which selectively bind to a target of interest, and assist in the
discovery of small molecule drugs and molecular probes.
First, we will start by connecting to a database.
Once again, the variable sampleDatabasePath can be replaced with the filename of the full PubChem BioAssay database (downloadable from http://chemmine.ucr.edu/bioassayr), if you would like to use
that instead of the small example version included with this software package.
The activeAgainst function can be used to show all small molecules in the database which demonstrate activity against COX-1 as follows.
Each row name represents a small molecule cid. The column labeled “total assays” shows the total number of times each small molecule has been screened against the target of interest. The column labeled “fraction active” shows the portion of these which were annotated as active as a number between 0 and 1.
This function allows users to consider different binding assays from distinct sources as replicates, to assist in distinguishing potentially spurious binding results from those with demonstrated reproducibility.
Looking only at compounds which show binding to the target of interest is not sufficient for identifying drug candidates, as a portion of these compounds may be target unselective compounds (TUCs) which bind
indiscriminately to a large number of distinct protein targets. The R function selectiveAgainst provides the user with a list of compounds that show activity against a target of interest (in at least one assay), while also showing limited activity against other targets.
The maxCompounds option limits the maximum number of results returned, and the minimumTargets
option limits returned compounds to those screened against a specified minimum of distinct targets. Results are formatted as a data.frame whereby each row name represents a distinct compound. The first column shows the number of distinct targets this compound shows activity against, and the second shows the total number of targets it was screened against.
In the example database these compounds are only showing one tested target because very few assays are loaded. Users are encouraged to try this example for themselves with the full PubChem BioAssay database downloadable from http://chemmine.ucr.edu/bioassayr for a more interesting and informative result.
Users can combine bioassayR with the ChemmineR library to obtain structural information on these target selective compounds, and then perform further analysis- such as structural clustering, visualization, and computing physicochemical properties.
The ChemmineR software library can be used to download structural data for any of these compounds, and to visualize these structures as follows (Cao, et al., 2008).
This example requires an active internet connection, as the compound structures are obtained from a remote server.
Here we visualize just the first four compounds found with selectiveAgainst. Consult the vignette supplied with ChemmineR for numerous examples of visualizing and analyzing these structures further.
Cluster Compounds by Activity Profile
This example demonstrates an example of clustering small molecules by similar bioactivity profiles across several distinct bioassay experiments.
In many cases it is too cpu and memory intensive to cluster all compounds in the database, so we first pull just a subset of these data from the database into an bioassaySet object, and then convert that into a compounds vs targets activity matrix for subsequent clustering according to similarities in activity profile.
The function getBioassaySetByCids extracts the activity data for a given list of compounds. Alternatively, the entire data for a given list of assay ids can be extracted with the function getAssays.
First, connect to the included sample database:
Next, select data from just 3 compounds to extract into a bioassaySet object for subsequent analysis.
The function perTargetMatrix converts the activity data extracted earlier into a matrix of targets (rows) vs compounds (columns). Data from multiple assays hitting the same target are summarized into a single value in the manner specified by the user. If you choose the summarizeReplicates option
activesFirst, any active scores take prescendence over inactives. If you choose the option mode the most abundant of either actives or inactives
is stored in the resulting matrix. You can also pass a custom function to decide how replicates are summarized.
In the resulting matrix a “2” represents an active compound vs target combination, a “1” represents an inactive combination, and a “0” represents
an untested or inconclusive combination. Inactive results can optionally be excluded from consideration
by passing the option inactives = FALSE. Here a sparse matrix (which omits actually storing 0 values) is used to save memory.
In a sparse matrix a period “.” entry is equal to a zero “0” entry, but is implied without
taking extra space in memory.
Here we create an activity matrix, choosing to include inactive values, and summarize replicates according to the statistical mode:
Next, we will re-create the activity matrix where protein targets
that are very similar at the sequence level (such as orthologues from different species)
are treated as replicates, and merged.
The function perTargetMatrix contains an option assayTargets
which let’s you specify the targets for each assay instead of taking them from the bioassaySet object.
The function assaySetTargets returns a vector of the targets for each assay
in a bioassaySet object, where the name of each element corresponds to it’s
assay identifier (aid). This is the format that must be passed to perTargetMatrix
to specify which assays are treated as replicates to be merged, so first we can
obtain these data, and then replace them with a custom merge criteria formatted
in the same manner.
The pre-built PubChem BioAssay database includes sequence level
protein target clustering results generated with the kClust tool
(options s=2.93, E-value < 1*10^-4, c=0.8) (Hauser, et al., 2013).
Each cluster has a unique number, and targets which cluster together are assigned
the same cluster number.
These clustering results are stored in the database as target
translations under category “kClust”. Now we will access these
traslations, and make a compound vs. target cluster matrix as follows.
Note that the merged matrix is smaller, because several similar protein targets
have been collapsed into single clusters.
Now, to make it possible to use binary clustering methods, we simplify the matrix
into a binary matrix where “1” represents active, and “0” represents either inactive
or untested combinations. We caution the user to carefully consider if this makes sense within the context of the specific experiments being analyzed.
As mentioned earlier, “0” and “.” entries in a sparse matrix are numerically equivalent.
Cluster using the built in euclidean clustering functions within R to cluster.
This provides a dendrogram which indicates the similarity amongst compounds according to their activity profiles.
A second way to compare activity profiles and cluster data is to pass
the activity matrix to the ChemmineR cheminformatics library as an FPset (binary fingerprint) object.
This represents the bioactivity data as a binary fingerprint (bioactivity fingerprint), which is a
binary string for each compound, where each bit represents activity (with a 1) or inactivity (with a 0) against
the full set of targets these compounds have shown activity against. This allows for pairwise comparison
of the bioactivity profile among compounds.
See the ChemmineR documentation at http://bioconductor.org/packages/ChemmineR/ for additional examples and details.
Perform activity profile similarity searching with the FPset object, by comparing
the first compound to all compounds.
Compute an all-against-all bioactivity fingerprint similarity matrix for these compounds.
Convert similarity matrix to a distance matrix and perform hierarcheal clustering.
Finally, disconnect from the database.
Analyze and Load Raw Screening Data
This example demonstrates the basics of analyzing and loading data
from a high throughput screening experiment with scores for 21,888
This example is based on the cellHTS2 library
(Boutros, et al., 2006). Example data
is used which is included with cellHTS2. This is actually data from
screening dsRNA against live cells, however we will treat it as small molecule
binding data against a protein target as the data format is the same.
First read in the screening data provided with cellHTS2.
Next, score and summarize the replicates.
Parse the annotation data.
Apply a sigmoidal transformation to generate binary calls.
Convert the binary calls into an activity table that bioassayR can parse.
Create a new (temporary in this case) bioassayR database to load these data into.
Create an assay object for the new assay.
Load this assay object into the bioassayR database.
Now that these data are loaded, you can use them to perform any of the other analysis examples in this document.
Lastly, for the purposes of this example, disconnect from the example database.
Custom SQL Queries
While many pre-built queries are provided (see other examples and man pages)
advanced users can also build their own SQL queries.
As bioassayR uses a SQLite database, you can consult http://www.sqlite.org for specifics on building SQL queries.
We also reccomend the book “Using SQLite” (Kreibich, 2010).
To get started first connect to a database. If you downloaded the full PubChem BioAssay
database from the authors website, the variable sampleDatabasePath can be replaced with the filename of the database you downloaded, if you would like to use
that instead of the small example version included with this software package.
First you will want to see the structure
of the database as follows:
For example, you can find the first 10 assays a given compound has participated in as follows:
This example prints the activity scores from a specified assay:
A NATURAL JOIN automatically merges tables which share common rows, making
it easier to parse data spread across many tables. Here we merge the activity
table (raw scores), with the assay table (assay annotation details) and the
protein targets table:
Lastly, disconnecting from the database after analysis reduces the chances of data corruption. If you are using a pre-built database in read only mode (as demonstrated in the Prebuilt Database Example section) you can optionally skip this step, as only writable databases are prone to corruption from failure to disconnect.