## Clustering Identical or Very Similar Compounds

Often it is of interest to identify very similar or identical compounds
in a compound set. The `cmp.duplicated`

function can be
used to quickly identify very similar compounds in atom pair sets, which
will be frequently, but not necessarily, identical compounds.

Identify compounds with identical AP sets:

Plot the structure of two pairs of duplicates:

Remove AP duplicates from SDFset and APset objects:

Alternatively, one can identify duplicates via other descriptor types if they are provided in the data block of an imported SD file. For instance, one can use here fingerprints, InChI, SMILES or other molecular representations. The following examples show how to enumerate by identical InChI strings, SMILES strings and molecular formula, respectively.

## Binning Clustering

Compound libraries can be clustered into discrete similarity groups with
the binning clustering function `cmp.cluster`

. The
function accepts as input an atom pair (`APset`

) or a
fingerprint (`FPset`

) descriptor database as well as a
similarity threshold. The binning clustering result is returned in form
of a data frame. Single linkage is used for cluster joining. The
function calculates the required compound-to-compound distance
information on the fly, while a memory-intensive distance matrix is only
created upon user request via the `save.distances`

argument (see below).

Because an optimum similarity threshold is often not known, the
`cmp.cluster`

function can calculate cluster results for
multiple cutoffs in one step with almost the same speed as for a single
cutoff. This can be achieved by providing several cutoffs under the
cutoff argument. The clustering results for the different cutoffs will
be stored in one data frame.

One may force the `cmp.cluster`

function to calculate and
store the distance matrix by supplying a file name to the
`save.distances`

argument. The generated distance matrix
can be loaded and passed on to many other clustering methods available
in R, such as the hierarchical clustering function
`hclust`

(see below).

If a distance matrix is available, it may also be supplied to
`cmp.cluster`

via the `use.distances`

argument. This is useful when one has a pre-computed distance matrix
either from a previous call to `cmp.cluster`

or from
other distance calculation subroutines.

Single-linkage binning clustering with one or multiple cutoffs:

Clustering of `FPset`

objects with multiple cutoffs. This
method allows to call various similarity methods provided by the
`fpSim`

function. For details consult
`?fpSim`

.

Sames as above, but using Tversky similarity measure:

Return cluster size distributions for each cutoff:

Enforce calculation of distance matrix:

## Jarvis-Patrick Clustering

The Jarvis-Patrick clustering algorithm is widely used in
cheminformatics (R.A. Jarvis , 1973). It requires a nearest neighbor table, which consists
of *j* nearest neighbors for each item (*e.g.* compound).
The nearest neighbor table is then used to join items into clusters when
they meet the following requirements: (a) they are contained in each
other’s neighbor list and (b) they share at least *k*
nearest neighbors. The values for *j* and
*k* are user-defined parameters. The
`jarvisPatrick`

function implemented in
`ChemmineR`

takes a nearest neighbor table generated by
`nearestNeighbors`

, which works for
`APset`

and `FPset`

objects. This function
takes either the standard Jarvis-Patrick *j* parameter
(as the `numNbrs`

parameter), or else a
`cutoff`

value, which is an extension to the basic
algorithm that we have added. Given a cutoff value, the nearest neighbor
table returned contains every neighbor with a similarity greater than
the cutoff value, for each item. This allows one to generate tighter
clusters and to minimize certain limitations of this method, such as
false joins of completely unrelated items when operating on small data
sets. The `trimNeighbors`

function can also be used to
take an existing nearest neighbor table and remove all neighbors whose
similarity value is below a given cutoff value. This allows one to
compute a very relaxed nearest neighbor table initially, and then
quickly try different refinements later.

In case an existing nearest neighbor matrix needs to be used, the
`fromNNMatrix`

function can be used to transform it into
the list structure that `jarvisPatrick`

requires. The
input matrix must have a row for each compound, and each row should be
the index values of the neighbors of compound represented by that row.
The names of each compound can also be given through the
`names`

argument. If not given, it will attempt to use
the `rownames`

of the given matrix.

The `jarvisPatrick`

function also allows one to relax
some of the requirements of the algorithm through the
`mode`

parameter. When set to “a1a2b”, then all
requirements are used. If set to “a1b”, then (a) is relaxed to a
unidirectional requirement. Lastly, if `mode`

is set to
“b”, then only requirement (b) is used, which means that all pairs of
items will be checked to see if (b) is satisfied between them. The size
of the clusters generated by the different methods increases in this
order: “a1a2b” < “a1b” < “b”. The run time of method “a1a2b” follows a
close to linear relationship, while it is nearly quadratic for the much
more exhaustive method “b”. Only methods “a1a2b” and “a1b” are suitable
for clustering very large data sets (e.g. >50,000 items) in a
reasonable amount of time.

An additional extension to the algorithm is the ability to set the
linkage mode. The `linkage`

parameter can be one of
“single”, “average”, or “complete”, for single linkage, average linkage
and complete linkage merge requirements, respectively. In the context of
Jarvis-Patrick, average linkage means that at least half of the pairs
between the clusters under consideration must meet requirement (b).
Similarly, for complete linkage, all pairs must requirement (b). Single
linkage is the normal case for Jarvis-Patrick and just means that at
least one pair must meet requirement (b).

The output is a cluster `vector`

with the item labels in
the name slot and the cluster IDs in the data slot. There is a utility
function called `byCluster`

, which takes out cluster
vector output by `jarvisPatrick`

and transforms it into a
list of vectors. Each slot of the list is named with a cluster id and
the vector contains the cluster members. By default the function
excludes singletons from the output, but they can be included by setting
`excludeSingletons`

=FALSE`.

Load/create sample `APset`

and `FPset`

:

Standard Jarvis-Patrick clustering on `APset`

and
`FPset`

objects:

The following example runs Jarvis-Patrick clustering with a minimum
similarity `cutoff`

value (here Tanimoto coefficient). In
addition, it uses the much more exhaustive `"b"`

method
that generates larger cluster sizes, but significantly increased the run
time. For more details, consult the corresponding help file with
`?jarvisPatrick`

.

Output nearest neighbor table (`matrix`

):

Trim nearest neighbor table:

Perform clustering on precomputed nearest neighbor table:

Using a user defined nearest neighbor matrix:

## Multi-Dimensional Scaling (MDS)

To visualize and compare clustering results, the
`cluster.visualize`

function can be used. The function
performs Multi-Dimensional Scaling (MDS) and visualizes the results in
form of a scatter plot. It requires as input an `APset`

,
a clustering result from `cmp.cluster`

, and a cutoff for
the minimum cluster size to consider in the plot. To help determining a
proper cutoff size, the `cluster.sizestat`

function is
provided to generate cluster size statistics.

MDS clustering and scatter plot:

Create a 3D scatter plot of MDS result:

Interactive 3D scatter plot with Open GL (graphics not evaluated here):

## Clustering with Other Algorithms

`ChemmineR`

allows the user to take advantage of the wide
spectrum of clustering utilities available in R. An example on how to
perform hierarchical clustering with the hclust function is given
below.

Create atom pair distance matrix:

Hierarchical clustering with `hclust`

:

Instead of atom pairs one can use PubChem’s fingerprints for clustering:

Plot dendrogram with heatmap (here similarity matrix):