Maximum Common Substructure (MCS) Searching

The ChemmineR add-on package fmcsR provides support for identifying maximum common substructures (MCSs) and flexible MCSs among compounds. The algorithm can be used for pairwise compound comparisons, structure similarity searching and clustering. The manual describing this functionality is available here and the associated publication is Wang et al. (2013). The following gives a short preview of some functionalities provided by the fmcsR package.

 library(fmcsR)
 data(fmcstest) # Loads test sdfset object 
 test <- fmcs(fmcstest[1], fmcstest[2], au=2, bu=1) # Searches for MCS with mismatches 
 plotMCS(test) # Plots both query compounds with MCS in color 

AP/APset Classes for Storing Atom Pair Descriptors

The function sdf2ap computes atom pair descriptors for one or many compounds (Raymond E. Carhart et al., 1985; X. Chen , 2002). It returns a searchable atom pair database stored in a container of class APset, which can be used for structural similarity searching and clustering. As similarity measure, the Tanimoto coefficient or related coefficients can be used. An APset object consists of one or many AP entries each storing the atom pairs of a single compound. Note: the deprecated cmp.parse function is still available which also generates atom pair descriptor databases, but directly from an SD file. Since the latter function is less flexible it may be discontinued in the future.

Generate atom pair descriptor database for searching:

 ap <- sdf2ap(sdfset[[1]]) # For single compound
 ap 
## An instance of "AP"
## <<atom pairs>>
## 52614450304 52615497856 52615514112 52616547456 52616554624 ... length: 528
 apset <- sdf2ap(sdfset)
 # For many compounds. 
view(apset[1:4]) 
## $`650001`
## An instance of "AP"
## <<atom pairs>>
## 53688190976 53688190977 53688190978 53688190979 53688190980 ... length: 528 
## 
## $`650002`
## An instance of "AP"
## <<atom pairs>>
## 53688190976 53688190977 53688190978 53688190979 53689239552 ... length: 325 
## 
## $`650003`
## An instance of "AP"
## <<atom pairs>>
## 52615496704 53688190976 53688190977 53689239552 53697627136 ... length: 325 
## 
## $`650004`
## An instance of "AP"
## <<atom pairs>>
## 52617593856 52618642432 52619691008 52619691009 52628079616 ... length: 496

Return main components of APset objects:

 cid(apset[1:4]) # Compound IDs 
 ap(apset[1:4]) # Atom pair
 descriptors 
 db.explain(apset[1]) # Return atom pairs in human readable format 

Coerce APset to other objects:

 apset2descdb(apset) # Returns old list-style AP database 
 tmp <- as(apset, "list") # Returns list 
 as(tmp, "APset") # Converts list back to APset 

Large SDF and Atom Pair Databases

When working with large data sets it is often desirable to save the SDFset and APset containers as binary R objects to files for later use. This way they can be loaded very quickly into a new R session without recreating them every time from scratch.

Save and load of SDFset and APset containers:

 save(sdfset, file = "sdfset.rda", compress = TRUE) 
 load("sdfset.rda") save(apset, file = "apset.rda", compress = TRUE)
 load("apset.rda") 

Pairwise Compound Comparisons with Atom Pairs

The cmp.similarity function computes the atom pair similarity between two compounds using the Tanimoto coefficient as similarity measure. The coefficient is defined as c/(a+b+c), which is the proportion of the atom pairs shared among two compounds divided by their union. The variable c is the number of atom pairs common in both compounds, while a and b are the numbers of their unique atom pairs.

 cmp.similarity(apset[1],
 apset[2])
## [1] 0.2637037
 cmp.similarity(apset[1], apset[1]) 
## [1] 1

Similarity Searching with Atom Pairs

The cmp.search function searches an atom pair database for compounds that are similar to a query compound. The following example returns a data frame where the rows are sorted by the Tanimoto similarity score (best to worst). The first column contains the indices of the matching compounds in the database. The argument cutoff can be a similarity cutoff, meaning only compounds with a similarity value larger than this cutoff will be returned; or it can be an integer value restricting how many compounds will be returned. When supplying a cutoff of 0, the function will return the similarity values for every compound in the database.

 cmp.search(apset,
 apset["650065"], type=3, cutoff = 0.3, quiet=TRUE) 
##   index    cid    scores
## 1    61 650066 1.0000000
## 2    60 650065 1.0000000
## 3    67 650072 0.3389831
## 4    11 650011 0.3190608
## 5    15 650015 0.3184524
## 6    86 650092 0.3154270
## 7    64 650069 0.3010279

Alternatively, the function can return the matches in form of an index or a named vector if the type argument is set to 1 or 2, respectively.

 cmp.search(apset, apset["650065"], type=1, cutoff = 0.3, quiet=TRUE)
## [1] 61 60 67 11 15 86 64
 cmp.search(apset, apset["650065"], type=2, cutoff = 0.3, quiet=TRUE) 
##    650066    650065    650072    650011    650015    650092    650069 
## 1.0000000 1.0000000 0.3389831 0.3190608 0.3184524 0.3154270 0.3010279

FP/FPset Classes for Storing Fingerprints

The FPset class stores fingerprints of small molecules in a matrix-like representation where every molecule is encoded as a fingerprint of the same type and length. The FPset container acts as a searchable database that contains the fingerprints of many molecules. The FP container holds only one fingerprint. Several constructor and coerce methods are provided to populate FP/FPset containers with fingerprints, while supporting any type and length of fingerprints. For instance, the function desc2fp generates fingerprints from an atom pair database stored in an APset, and as(matrix, "FPset") and as(character, "FPset") construct an FPset database from objects where the fingerprints are represented as matrix or character objects, respectively.

Show slots of FPset class:

 showClass("FPset") 
## Class "FPset" [package "ChemmineR"]
## 
## Slots:
##                                     
## Name:       fpma      type foldCount
## Class:    matrix character   numeric

Instance of FPset class:

 data(apset) 
 fpset <- desc2fp(apset)
 view(fpset[1:2]) 
## $`650001`
## An instance of "FP" of type "unknown-4721"
## <<fingerprint>>
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... length: 1024 
## 
## $`650002`
## An instance of "FP" of type "unknown-4173"
## <<fingerprint>>
## 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 ... length: 1024

FPset class usage:

 fpset[1:4] # behaves like a list 
## An instance of a 1024 bit "FPset" of type "apfp" with 4 molecules
 fpset[[1]] # returns FP object 
## An instance of "FP" of type "unknown-2986"
## <<fingerprint>>
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... length: 1024
 length(fpset) # number of compounds ENDCOMMENT
## [1] 100
 cid(fpset) # returns compound ids 
##   [1] "650001" "650002" "650003" "650004" "650005" "650006" "650007" "650008" "650009" "650010"
##  [11] "650011" "650012" "650013" "650014" "650015" "650016" "650017" "650019" "650020" "650021"
##  [21] "650022" "650023" "650024" "650025" "650026" "650027" "650028" "650029" "650030" "650031"
##  [31] "650032" "650033" "650034" "650035" "650036" "650037" "650038" "650039" "650040" "650041"
##  [41] "650042" "650043" "650044" "650045" "650046" "650047" "650048" "650049" "650050" "650052"
##  [51] "650054" "650056" "650058" "650059" "650060" "650061" "650062" "650063" "650064" "650065"
##  [61] "650066" "650067" "650068" "650069" "650070" "650071" "650072" "650073" "650074" "650075"
##  [71] "650076" "650077" "650078" "650079" "650080" "650081" "650082" "650083" "650085" "650086"
##  [81] "650087" "650088" "650089" "650090" "650091" "650092" "650093" "650094" "650095" "650096"
##  [91] "650097" "650098" "650099" "650100" "650101" "650102" "650103" "650104" "650105" "650106"
 fpset[10] <- 0 # replacement of 10th fingerprint to all zeros 
 cid(fpset) <- 1:length(fpset) # replaces compound ids 
 c(fpset[1:4], fpset[11:14]) # concatenation of several FPset objects 
## An instance of a 1024 bit "FPset" of type "apfp" with 8 molecules

Construct FPset class form matrix:

 fpma <- as.matrix(fpset) # coerces FPset to matrix 
 as(fpma, "FPset") 
## An instance of a 1024 bit "FPset" of type "unknown-6312" with 100 molecules

Construct FPset class form character vector:

 fpchar <- as.character(fpset) # coerces FPset to character strings 
 as(fpchar, "FPset") # construction of FPset class from character vector
## An instance of a 1024 bit "FPset" of type "apfp" with 100 molecules

Compound similarity searching with FPset:

 fpSim(fpset[1], fpset, method="Tanimoto", cutoff=0.4, top=4) 
##         1        96        67        15 
## 1.0000000 0.4719101 0.4288499 0.4275229

Folding fingerprints:

 fold(fpset) # fold each FP once
## An instance of a 512 bit "FPset" of type "apfp" with 100 molecules
 fold(fpset, count=2) #fold each FP twice
## An instance of a 256 bit "FPset" of type "apfp" with 100 molecules
 fold(fpset, bits=128) #fold each FP down to 128 bits
## An instance of a 128 bit "FPset" of type "apfp" with 100 molecules
 fold(fpset[[1]])  # fold an individual FP
## An instance of "FP" of type "unknown-2996"
## <<fingerprint>>
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... length: 512
 fptype(fpset) # get type of FPs
## [1] "apfp"
 numBits(fpset) # get the number of bits of each FP
## [1] 1024
 foldCount(fold(fpset)) # the number of times an FP or FPset has been folded
## [1] 1

Atom Pair Fingerprints

Atom pairs can be converted into binary atom pair fingerprints of fixed length. Computations on this compact data structure are more time and memory efficient than on their relatively complex atom pair counterparts. The function desc2fp generates fingerprints from descriptor vectors of variable length such as atom pairs stored in APset or list containers. The obtained fingerprints can be used for structure similarity comparisons, searching and clustering.

Create atom pair sample data set:

 data(sdfsample)
 sdfset <- sdfsample[1:10] 
 apset <- sdf2ap(sdfset) 

Compute atom pair fingerprint database using internal atom pair selection containing the 4096 most common atom pairs identified in DrugBank’s compound collection. For details see ?apfp. The following example uses from this set the 1024 most frequent atom pairs:

 fpset <- desc2fp(apset, descnames=1024, type="FPset") 

Alternatively, one can provide any custom atom pair selection. Here, the 1024 most common ones in apset:

 fpset1024 <- names(rev(sort(table(unlist(as(apset, "list")))))[1:1024]) 
 fpset <- desc2fp(apset, descnames=fpset1024, type="FPset") 

A more compact way of storing fingerprints is as character values:

 fpchar <- desc2fp(x=apset,
 descnames=1024, type="character") fpchar <- as.character(fpset) 

Converting a fingerprint database to a matrix and vice versa:

 fpma <- as.matrix(fpset)
 fpset <- as(fpma, "FPset") 

Similarity searching and returning Tanimoto similarity coefficients:

 fpSim(fpset[1], fpset, method="Tanimoto") 

Under method one can choose from several predefined similarity measures including Tanimoto (default), Euclidean, Tversky or Dice. Alternatively, one can pass on custom similarity functions.

 fpSim(fpset[1], fpset, method="Tversky", cutoff=0.4, top=4, alpha=0.5, beta=1) 

Example for using a custom similarity function:

 myfct <- function(a, b, c, d) c/(a+b+c+d)
 fpSim(fpset[1], fpset, method=myfct) 

Clustering example:

 simMAap <- sapply(cid(apfpset), function(x) fpSim(x=apfpset[x], apfpset, sorted=FALSE)) 
 hc <- hclust(as.dist(1-simMAap), method="single")
 plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE) 

Fingerprint E-values

The fpSim function can also return Z-scores, E-values, and p-values if given a set of score distribution parameters. These parameters can be computed over an fpSet with the genParameters function.

    params <- genParameters(fpset)

This function will compute all pairwise distances between the given fingerprints and then fit a Beta distribution to the resulting Tanimoto scores, conditioned on the number of set bits in each fingerprint. For large data sets where you would not want to compute all pairwise distances, you can set what fraction to sample with the sampleFraction argument. This step only needs to be done once for each database of fpSet objects. Alternatively, if you have a large database of fingerprints, or you believe that the parameters computed on a single database are more generally applicable, you can use the resulting parameters for other databases as well.

Once you have a set of parameters, you can pass them to fpSim with the parameters argument.

	fpSim(fpset[[1]], fpset, top=10, parameters=params)	
##    similarity    zscore    evalue    pvalue
## 1   1.0000000 6.2418215  0.000000 0.0000000
## 96  0.4719101 1.6075792  6.748413 0.9988273
## 67  0.4288499 1.2297052 12.012285 0.9999939
## 15  0.4275229 1.2180604 12.211967 0.9999950
## 88  0.4247423 1.1936587 12.638193 0.9999968
## 64  0.4187380 1.1409688 13.594938 0.9999988
## 4   0.4166667 1.1227914 13.936692 0.9999991
## 86  0.3978686 0.9578290 17.319191 1.0000000
## 77  0.3970588 0.9507232 17.476453 1.0000000
## 69  0.3940000 0.9238806 18.079243 1.0000000

This will then return a data frame with the similarity, Z-score, E-value, and p-value. You can change which value will be used as a cutoff and to sort by by setting the argument scoreType to one of these scores. In this way you could set an E-value cutoff of 0.04 for example.

	fpSim(fpset[[1]], fpset, cutoff=0.04, scoreType="evalue", parameters=params)	
##   similarity   zscore evalue pvalue
## 1          1 6.241822      0      0

Pairwise Compound Comparisons with PubChem Fingerprints

The fpSim function computes the similarity coefficients (e.g. Tanimoto) for pairwise comparisons of binary fingerprints. For this data type, c is the number of “on-bits” common in both compounds, and a and b are the numbers of their unique “on-bits”. Currently, the PubChem fingerprints need to be provided (here PubChem’s SD files) and cannot be computed from scratch in ChemmineR. The PubChem fingerprint specifications can be loaded with data(pubchemFPencoding).

Convert base 64 encoded PubChem fingerprints to character vector, matrix or FPset object:

 cid(sdfset) <- sdfid(sdfset)
 fpset <- fp2bit(sdfset, type=1) 
 fpset <- fp2bit(sdfset, type=2) 
 fpset <- fp2bit(sdfset, type=3) 
 fpset 
## An instance of a 881 bit "FPset" of type "pubchem" with 100 molecules

Pairwise compound structure comparisons:

 fpSim(fpset[1], fpset[2]) 
##    650002 
## 0.5364807

Similarity Searching with PubChem Fingerprints

Similarly, the fpSim function provides search functionality for PubChem fingerprints:

 fpSim(fpset["650065"], fpset, method="Tanimoto", cutoff=0.6, top=6) 
##    650065    650066    650035    650019    650012    650046 
## 1.0000000 0.9944751 0.7435897 0.7432432 0.7230047 0.7142857

Visualize Similarity Search Results

The cmp.search function allows to visualize the chemical structures for the search results. Similar but more flexible chemical structure rendering functions are plot and sdf.visualize described above. By setting the visualize argument in cmp.search to TRUE, the matching compounds and their scores can be visualized with a standard web browser. Depending on the visualize.browse argument, an URL will be printed or a webpage will be opened showing the structures of the matching compounds.

View similarity search results in R’s graphics device:

 cid(sdfset) <-
 cid(apset) # Assure compound name consistency among objects. 

 plot(sdfset[names(cmp.search(apset, apset["650065"], type=2, cutoff=4, quiet=TRUE))], print=FALSE) 

View results online with Chemmine Tools:

 similarities <- cmp.search(apset, apset[1], type=3, cutoff = 10)
 sdf.visualize(sdfset[similarities[,1]])