The sdfStream
function allows to stream through SD
Files with millions of molecules without consuming much memory. During
this process any set of descriptors, supported by
ChemmineR
, can be computed (e.g. atom pairs,
molecular properties, etc.), as long as they can be returned in tabular
format. In addition to descriptor values, the function returns a line
index that gives the start and end positions of each molecule in the
source SD File. This line index can be used by the downstream
read.SDFindex
function to retrieve specific molecules
of interest from the source SD File without reading the entire file into
R. The following outlines the typical workflow of this streaming
functionality in ChemmineR
.
Create sample SD File with 100 molecules:
write.SDF(sdfset, "test.sdf")
Define descriptor set in a simple function:
desc <- function(sdfset)
cbind(SDFID=sdfid(sdfset),
# datablock2ma(datablocklist=datablock(sdfset)),
MW=MW(sdfset),
groups(sdfset), APFP=desc2fp(x=sdf2ap(sdfset), descnames=1024,
type="character"), AP=sdf2ap(sdfset, type="character"), rings(sdfset,
type="count", upper=6, arom=TRUE) )
Run sdfStream
with desc
function and
write results to a file called matrix.xls
:
sdfStream(input="test.sdf", output="matrix.xls", fct=desc, Nlines=1000) # 'Nlines': number of lines to read from input SD File at a time
One can also start reading from a specific line number in the SD file.
The following example starts at line number 950. This is useful for
restarting and debugging the process. With append=TRUE
the result can be appended to an existing file.
sdfStream(input="test.sdf", output="matrix2.xls", append=FALSE, fct=desc, Nlines=1000, startline=950)
Select molecules meeting certain property criteria from SD File using
line index generated by previous sdfStream
step:
indexDF <- read.delim("matrix.xls", row.names=1)[,1:4]
indexDFsub <- indexDF[indexDF$MW < 400, ] # Selects molecules with MW < 400
sdfset <- read.SDFindex(file="test.sdf", index=indexDFsub, type="SDFset") # Collects results in 'SDFset' container
Write results directly to SD file without storing larger numbers of
molecules in memory:
read.SDFindex(file="test.sdf", index=indexDFsub, type="file",
outfile="sub.sdf")
Read AP/APFP strings from file into APset
or
FP
object:
apset <- read.AP(x="matrix.xls", type="ap", colid="AP")
apfp <- read.AP(x="matrix.xls", type="fp", colid="APFP")
Alternatively, one can provide the AP/APFP strings in a named character
vector:
apset <- read.AP(x=sdf2ap(sdfset[1:20], type="character"), type="ap")
fpchar <- desc2fp(sdf2ap(sdfset[1:20]), descnames=1024, type="character")
fpset <- as(fpchar, "FPset")