Clustering Identical or Very Similar Compounds
Often it is of interest to identify very similar or identical compounds
in a compound set. The cmp.duplicated
function can be
used to quickly identify very similar compounds in atom pair sets, which
will be frequently, but not necessarily, identical compounds.
Identify compounds with identical AP sets:
cmp.duplicated(apset, type=1)[1:4] # Returns AP duplicates as logical vector
## [1] FALSE FALSE FALSE FALSE
cmp.duplicated(apset, type=2)[1:4,] # Returns AP duplicates as data frame
## ids CLSZ_100 CLID_100
## 1 650082 1 1
## 2 650059 2 2
## 3 650060 2 2
## 4 650010 1 3
Plot the structure of two pairs of duplicates:
plot(sdfset[c("650059","650060", "650065", "650066")], print=FALSE)

Remove AP duplicates from SDFset and APset objects:
apdups <- cmp.duplicated(apset, type=1)
sdfset[which(!apdups)]; apset[which(!apdups)]
## An instance of "SDFset" with 96 molecules
## An instance of "APset" with 96 molecules
Alternatively, one can identify duplicates via other descriptor types if
they are provided in the data block of an imported SD file. For
instance, one can use here fingerprints, InChI, SMILES or other
molecular representations. The following examples show how to enumerate
by identical InChI strings, SMILES strings and molecular formula,
respectively.
count <- table(datablocktag(sdfset,
tag="PUBCHEM_NIST_INCHI"))
count <- table(datablocktag(sdfset, tag="PUBCHEM_OPENEYE_CAN_SMILES"))
count <- table(datablocktag(sdfset, tag="PUBCHEM_MOLECULAR_FORMULA"))
count[1:4]
##
## C10H9FN2O2S C11H12N4OS C11H13NO4 C12H11ClN2OS
## 1 1 1 1
Binning Clustering
Compound libraries can be clustered into discrete similarity groups with
the binning clustering function cmp.cluster
. The
function accepts as input an atom pair (APset
) or a
fingerprint (FPset
) descriptor database as well as a
similarity threshold. The binning clustering result is returned in form
of a data frame. Single linkage is used for cluster joining. The
function calculates the required compound-to-compound distance
information on the fly, while a memory-intensive distance matrix is only
created upon user request via the save.distances
argument (see below).
Because an optimum similarity threshold is often not known, the
cmp.cluster
function can calculate cluster results for
multiple cutoffs in one step with almost the same speed as for a single
cutoff. This can be achieved by providing several cutoffs under the
cutoff argument. The clustering results for the different cutoffs will
be stored in one data frame.
One may force the cmp.cluster
function to calculate and
store the distance matrix by supplying a file name to the
save.distances
argument. The generated distance matrix
can be loaded and passed on to many other clustering methods available
in R, such as the hierarchical clustering function
hclust
(see below).
If a distance matrix is available, it may also be supplied to
cmp.cluster
via the use.distances
argument. This is useful when one has a pre-computed distance matrix
either from a previous call to cmp.cluster
or from
other distance calculation subroutines.
Single-linkage binning clustering with one or multiple cutoffs:
clusters <- cmp.cluster(db=apset, cutoff = c(0.7, 0.8, 0.9), quiet = TRUE)
##
## sorting result...
clusters[1:12,]
## ids CLSZ_0.7 CLID_0.7 CLSZ_0.8 CLID_0.8 CLSZ_0.9 CLID_0.9
## 48 650049 2 48 2 48 2 48
## 49 650050 2 48 2 48 2 48
## 54 650059 2 54 2 54 2 54
## 55 650060 2 54 2 54 2 54
## 56 650061 2 56 2 56 2 56
## 57 650062 2 56 2 56 2 56
## 58 650063 2 58 2 58 2 58
## 59 650064 2 58 2 58 2 58
## 60 650065 2 60 2 60 2 60
## 61 650066 2 60 2 60 2 60
## 1 650001 1 1 1 1 1 1
## 2 650002 1 2 1 2 1 2
Clustering of FPset
objects with multiple cutoffs. This
method allows to call various similarity methods provided by the
fpSim
function. For details consult
?fpSim
.
fpset <- desc2fp(apset)
clusters2 <- cmp.cluster(fpset, cutoff=c(0.5, 0.7, 0.9), method="Tanimoto", quiet=TRUE)
##
## sorting result...
clusters2[1:12,]
## ids CLSZ_0.5 CLID_0.5 CLSZ_0.7 CLID_0.7 CLSZ_0.9 CLID_0.9
## 69 650074 14 11 2 69 1 69
## 79 650085 14 11 2 69 1 79
## 11 650011 14 11 1 11 1 11
## 15 650015 14 11 1 15 1 15
## 45 650046 14 11 1 45 1 45
## 47 650048 14 11 1 47 1 47
## 51 650054 14 11 1 51 1 51
## 53 650058 14 11 1 53 1 53
## 64 650069 14 11 1 64 1 64
## 65 650070 14 11 1 65 1 65
## 67 650072 14 11 1 67 1 67
## 86 650092 14 11 1 86 1 86
Sames as above, but using Tversky similarity measure:
clusters3 <- cmp.cluster(fpset, cutoff=c(0.5, 0.7, 0.9),
method="Tversky", alpha=0.3, beta=0.7, quiet=TRUE)
##
## sorting result...
Return cluster size distributions for each cutoff:
cluster.sizestat(clusters, cluster.result=1)
## cluster size count
## 1 1 90
## 2 2 5
cluster.sizestat(clusters, cluster.result=2)
## cluster size count
## 1 1 90
## 2 2 5
cluster.sizestat(clusters, cluster.result=3)
## cluster size count
## 1 1 90
## 2 2 5
Enforce calculation of distance matrix:
clusters <- cmp.cluster(db=apset, cutoff = c(0.65, 0.5, 0.3),
save.distances="distmat.rda") # Saves distance matrix to file "distmat.rda" in current working directory.
load("distmat.rda") # Loads distance matrix.
Jarvis-Patrick Clustering
The Jarvis-Patrick clustering algorithm is widely used in
cheminformatics (R.A. Jarvis , 1973). It requires a nearest neighbor table, which consists
of j nearest neighbors for each item (e.g. compound).
The nearest neighbor table is then used to join items into clusters when
they meet the following requirements: (a) they are contained in each
other’s neighbor list and (b) they share at least k
nearest neighbors. The values for j and
k are user-defined parameters. The
jarvisPatrick
function implemented in
ChemmineR
takes a nearest neighbor table generated by
nearestNeighbors
, which works for
APset
and FPset
objects. This function
takes either the standard Jarvis-Patrick j parameter
(as the numNbrs
parameter), or else a
cutoff
value, which is an extension to the basic
algorithm that we have added. Given a cutoff value, the nearest neighbor
table returned contains every neighbor with a similarity greater than
the cutoff value, for each item. This allows one to generate tighter
clusters and to minimize certain limitations of this method, such as
false joins of completely unrelated items when operating on small data
sets. The trimNeighbors
function can also be used to
take an existing nearest neighbor table and remove all neighbors whose
similarity value is below a given cutoff value. This allows one to
compute a very relaxed nearest neighbor table initially, and then
quickly try different refinements later.
In case an existing nearest neighbor matrix needs to be used, the
fromNNMatrix
function can be used to transform it into
the list structure that jarvisPatrick
requires. The
input matrix must have a row for each compound, and each row should be
the index values of the neighbors of compound represented by that row.
The names of each compound can also be given through the
names
argument. If not given, it will attempt to use
the rownames
of the given matrix.
The jarvisPatrick
function also allows one to relax
some of the requirements of the algorithm through the
mode
parameter. When set to “a1a2b”, then all
requirements are used. If set to “a1b”, then (a) is relaxed to a
unidirectional requirement. Lastly, if mode
is set to
“b”, then only requirement (b) is used, which means that all pairs of
items will be checked to see if (b) is satisfied between them. The size
of the clusters generated by the different methods increases in this
order: “a1a2b” < “a1b” < “b”. The run time of method “a1a2b” follows a
close to linear relationship, while it is nearly quadratic for the much
more exhaustive method “b”. Only methods “a1a2b” and “a1b” are suitable
for clustering very large data sets (e.g. >50,000 items) in a
reasonable amount of time.
An additional extension to the algorithm is the ability to set the
linkage mode. The linkage
parameter can be one of
“single”, “average”, or “complete”, for single linkage, average linkage
and complete linkage merge requirements, respectively. In the context of
Jarvis-Patrick, average linkage means that at least half of the pairs
between the clusters under consideration must meet requirement (b).
Similarly, for complete linkage, all pairs must requirement (b). Single
linkage is the normal case for Jarvis-Patrick and just means that at
least one pair must meet requirement (b).
The output is a cluster vector
with the item labels in
the name slot and the cluster IDs in the data slot. There is a utility
function called byCluster
, which takes out cluster
vector output by jarvisPatrick
and transforms it into a
list of vectors. Each slot of the list is named with a cluster id and
the vector contains the cluster members. By default the function
excludes singletons from the output, but they can be included by setting
excludeSingletons
=FALSE`.
Load/create sample APset
and FPset
:
data(apset)
fpset <- desc2fp(apset)
Standard Jarvis-Patrick clustering on APset
and
FPset
objects:
jarvisPatrick(nearestNeighbors(apset,numNbrs=6), k=5, mode="a1a2b")
## 650001 650002 650003 650004 650005 650006 650007 650008 650009 650010 650011 650012 650013 650014
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 650015 650016 650017 650019 650020 650021 650022 650023 650024 650025 650026 650027 650028 650029
## 11 15 16 17 18 19 20 21 22 23 24 25 26 27
## 650030 650031 650032 650033 650034 650035 650036 650037 650038 650039 650040 650041 650042 650043
## 28 29 30 31 32 33 34 35 36 37 38 39 40 41
## 650044 650045 650046 650047 650048 650049 650050 650052 650054 650056 650058 650059 650060 650061
## 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## 650062 650063 650064 650065 650066 650067 650068 650069 650070 650071 650072 650073 650074 650075
## 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## 650076 650077 650078 650079 650080 650081 650082 650083 650085 650086 650087 650088 650089 650090
## 70 71 72 73 74 75 76 77 78 79 80 81 82 83
## 650091 650092 650093 650094 650095 650096 650097 650098 650099 650100 650101 650102 650103 650104
## 84 85 86 87 88 89 90 91 92 93 94 95 96 97
## 650105 650106
## 98 99
#Using "APset"
jarvisPatrick(nearestNeighbors(fpset,numNbrs=6), k=5, mode="a1a2b")
## 650001 650002 650003 650004 650005 650006 650007 650008 650009 650010 650011 650012 650013 650014
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 650015 650016 650017 650019 650020 650021 650022 650023 650024 650025 650026 650027 650028 650029
## 11 15 16 17 18 19 20 21 22 23 24 25 26 27
## 650030 650031 650032 650033 650034 650035 650036 650037 650038 650039 650040 650041 650042 650043
## 28 29 30 31 32 33 34 35 36 37 38 39 40 41
## 650044 650045 650046 650047 650048 650049 650050 650052 650054 650056 650058 650059 650060 650061
## 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## 650062 650063 650064 650065 650066 650067 650068 650069 650070 650071 650072 650073 650074 650075
## 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## 650076 650077 650078 650079 650080 650081 650082 650083 650085 650086 650087 650088 650089 650090
## 70 71 72 73 74 75 76 77 78 79 80 81 82 83
## 650091 650092 650093 650094 650095 650096 650097 650098 650099 650100 650101 650102 650103 650104
## 84 85 86 87 88 89 90 91 92 93 94 1 95 96
## 650105 650106
## 97 98
#Using "FPset"
The following example runs Jarvis-Patrick clustering with a minimum
similarity cutoff
value (here Tanimoto coefficient). In
addition, it uses the much more exhaustive "b"
method
that generates larger cluster sizes, but significantly increased the run
time. For more details, consult the corresponding help file with
?jarvisPatrick
.
cl<-jarvisPatrick(nearestNeighbors(fpset,cutoff=0.6,
method="Tanimoto"), k=2 ,mode="b")
byCluster(cl)
## $`11`
## [1] "650011" "650092"
##
## $`15`
## [1] "650015" "650069"
##
## $`45`
## [1] "650046" "650054"
##
## $`48`
## [1] "650049" "650050"
##
## $`52`
## [1] "650059" "650060"
##
## $`53`
## [1] "650061" "650062"
##
## $`54`
## [1] "650063" "650064"
##
## $`55`
## [1] "650065" "650066"
##
## $`62`
## [1] "650074" "650085"
Output nearest neighbor table (matrix
):
nnm <- nearestNeighbors(fpset,numNbrs=6)
nnm$names[1:4]
## [1] "650001" "650002" "650003" "650004"
nnm$ids[1:4,]
## NULL
nnm$similarities[1:4,]
## 650001 650102 650072 650015 650094 650069
## sim 1 0.4719101 0.4288499 0.4275229 0.4247423 0.4187380
## sim 1 0.4343891 0.4246575 0.4216867 0.3939394 0.3922078
## sim 1 0.4152249 0.3619303 0.3610315 0.3424242 0.3367089
## sim 1 0.5791045 0.4973958 0.4192708 0.4166667 0.4104683
Trim nearest neighbor table:
nnm <- trimNeighbors(nnm,cutoff=0.4)
nnm$similarities[1:4,]
## 650001 650102 650072 650015 650094 650069
## sim 1 0.4719101 0.4288499 0.4275229 0.4247423 0.4187380
## sim 1 0.4343891 0.4246575 0.4216867 NA NA
## sim 1 0.4152249 NA NA NA NA
## sim 1 0.5791045 0.4973958 0.4192708 0.4166667 0.4104683
Perform clustering on precomputed nearest neighbor table:
jarvisPatrick(nnm, k=5,mode="b")
## 650001 650002 650003 650004 650005 650006 650007 650008 650009 650010 650011 650012 650013 650014
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 650015 650016 650017 650019 650020 650021 650022 650023 650024 650025 650026 650027 650028 650029
## 11 15 16 17 18 19 20 21 22 23 24 25 26 27
## 650030 650031 650032 650033 650034 650035 650036 650037 650038 650039 650040 650041 650042 650043
## 28 29 30 31 32 33 34 35 36 37 38 39 40 41
## 650044 650045 650046 650047 650048 650049 650050 650052 650054 650056 650058 650059 650060 650061
## 42 43 11 44 11 45 46 47 48 49 50 51 52 53
## 650062 650063 650064 650065 650066 650067 650068 650069 650070 650071 650072 650073 650074 650075
## 54 55 56 57 57 58 59 11 60 61 62 63 64 65
## 650076 650077 650078 650079 650080 650081 650082 650083 650085 650086 650087 650088 650089 650090
## 66 67 68 69 37 70 71 72 64 73 74 75 76 77
## 650091 650092 650093 650094 650095 650096 650097 650098 650099 650100 650101 650102 650103 650104
## 78 11 79 80 81 82 83 84 85 86 87 1 88 89
## 650105 650106
## 90 91
Using a user defined nearest neighbor matrix:
nn <- matrix(c(1,2,2,1),2,2,dimnames=list(c('one','two')))
nn
## [,1] [,2]
## one 1 2
## two 2 1
byCluster(jarvisPatrick(fromNNMatrix(nn),k=1))
## $`1`
## [1] "one" "two"
Multi-Dimensional Scaling (MDS)
To visualize and compare clustering results, the
cluster.visualize
function can be used. The function
performs Multi-Dimensional Scaling (MDS) and visualizes the results in
form of a scatter plot. It requires as input an APset
,
a clustering result from cmp.cluster
, and a cutoff for
the minimum cluster size to consider in the plot. To help determining a
proper cutoff size, the cluster.sizestat
function is
provided to generate cluster size statistics.
MDS clustering and scatter plot:
cluster.visualize(apset, clusters, size.cutoff=2, quiet = TRUE) # Color codes clusters with at least two members.
cluster.visualize(apset, clusters, quiet = TRUE) # Plots all items.
Create a 3D scatter plot of MDS result:
library(scatterplot3d)
coord <- cluster.visualize(apset, clusters, size.cutoff=1, dimensions=3, quiet=TRUE)
scatterplot3d(coord)

Interactive 3D scatter plot with Open GL (graphics not evaluated here):
library(rgl) rgl.open(); offset <- 50;
par3d(windowRect=c(offset, offset, 640+offset, 640+offset))
rm(offset)
rgl.clear()
rgl.viewpoint(theta=45, phi=30, fov=60, zoom=1)
spheres3d(coord[,1], coord[,2], coord[,3], radius=0.03, color=coord[,4], alpha=1, shininess=20)
aspect3d(1, 1, 1)
axes3d(col='black')
title3d("", "", "", "", "", col='black')
bg3d("white") # To save a snapshot of the graph, one can use the command rgl.snapshot("test.png").
Clustering with Other Algorithms
ChemmineR
allows the user to take advantage of the wide
spectrum of clustering utilities available in R. An example on how to
perform hierarchical clustering with the hclust function is given
below.
Create atom pair distance matrix:
dummy <- cmp.cluster(db=apset, cutoff=0, save.distances="distmat.rda", quiet=TRUE)
##
## sorting result...
load("distmat.rda")
Hierarchical clustering with hclust
:
hc <- hclust(as.dist(distmat), method="single")
hc[["labels"]] <- cid(apset) # Assign correct item labels
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=T)

Instead of atom pairs one can use PubChem’s fingerprints for clustering:
simMA <- sapply(cid(fpset), function(x) fpSim(fpset[x], fpset, sorted=FALSE))
hc <- hclust(as.dist(1-simMA), method="single")
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)
Plot dendrogram with heatmap (here similarity matrix):
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
heatmap.2(1-distmat, Rowv=as.dendrogram(hc), Colv=as.dendrogram(hc),
col=colorpanel(40, "darkblue", "yellow", "white"),
density.info="none", trace="none")
