Girke Lab Site
To navigate this site, please click the ☰ symbol to the left.
My research focuses on the development of computational data analysis methods
for genome biology and small molecule discovery. This includes
discovery-oriented data mining projects, as well as algorithm and software
development projects for data types from a variety of high-throughput
technologies such as next generation sequencing (NGS), genome-wide profiling
approaches and chemical genomics. As part of the multidisciplinary nature of my
field, I frequently collaborate with experimental scientists on data analysis
projects of complex biological networks. Another important activity is the
development of integrated data analysis systems for the open source software
projects R and Bioconductor. The following gives a short summary of a few
selected projects in my group.
systemPipeR: NGS workflow and report generation environment
systemPipeR is an
R/Bioconductor package for building and running automated analysis workflows
for a wide range of next generation sequence (NGS) applications. Important
features include a uniform workflow interface across different NGS
applications, automated report generation, and support for running both R and
command-line software, such as NGS aligners or peak/variant callers, on local
computers or compute clusters. Efficient handling of complex sample sets and
experimental designs is facilitated by a consistently implemented sample
Figure 1: Workflow design structure of systemPipeR.
Reference-Assisted Transcriptome Assembly
Owing to the complexity and often incomplete representation of transcripts in
RNA-Seq libraries, the assembly of high-quality transcriptomes can be extremely
challenging. To improve this, my group is developing
algorithms for guiding these assemblies with genomic sequences of related organisms as
well as reducing the complexity in NGS libraries. The software tools we have published for this
purpose so far include SEED (Bao et al., 2011)
and BRANCH (Bao et al., 2013). BRANCH
is a reference assisted post-processing method for enhancing de novo
transcriptome assemblies (Figure 2). It can be used in combination with most de novo
transcriptome assembly software tools. The assembly improvements are achieved
with help from partial or complete genomic sequence information. They can be
obtained by sequencing and assembling a genomic DNA sample in addition to the
RNA samples required for a transcriptome assembly project. This approach is
practical because it requires only preliminary genome assembly results in form
of contigs. Nowadays, the latter can be generated with very reasonable cost and
time investments. In case the genome sequence of a closely related organism is
available, one can skip the genome assembly step and use the related gene
sequences instead. This type of reference assisted assembly approach provides
many attractive opportunities for improving de novo NGS assemblies in the
future by making use of the rapidly growing number of reference genome
information available to us.
Figure 2: Outline of *BRANCH* algorithm published in Bao et al. 2013. (a) Read alignments against preassembled transcripts and closely related genomic reference. (b) Junction graph based on this alignment result. (c) Assembly of extended transcripts.
Modeling Gene Expression Networks from RNA-Seq and ChIP-Seq Data
As part of several collaborative research projects, my group has developed a
variety of data analysis pipelines for profiling data from next generation
sequencing projects (e.g. RNA-Seq and ChIP-Seq), microarray experiments and
high-throughput small molecule screens. Most of the data analysis resources
developed by these projects are described in the associated online manuals for
next generation data analysis.
Recent research publications of these projects include:
Yang et al., 2013;
Zou et al., 2013;
Yadav et al., 2013;
Yadav et al., 2011;
Mustroph et al., 2009.
Software Resources for Small Molecule Discovery and Chemical Genomics
Software tools for modeling the similarities among drug-like small molecules
and high-throughput screening data are important for many applications in drug
discovery and chemical genomics. In this area we are working on the development
of the ChemmineR
environment (Cao et al., 2008;
Backman et al., 2011). This modular
software infrastructure consists currently of five R/Bioconductor packages along with a
user-friendly web interface, named ChemMine Tools that
is intended for non-expert users (Figures 3-4). The integration of cheminformatic
tools with the R programming environment has many advantages for small molecule discovery, such as easy access to a wide spectrum
of statistical methods, machine learning algorithms and graphic utilities.
Currently, the ChemmineR toolkit
provides utilities for processing large numbers of molecules,
physicochemical/structural property predictions, structural similarity
searching, classification and clustering of compound libraries and screening
results with a wide spectrum of algorithms. More recently, we have developed
for this infrastructure the fmcsR algorithm which is the first mismatch tolerant
maximum common substructure search tool in
the field (Wang et al., 2013).
In our comparisons with related structure similarity search tools, fmcsR
showed the best virtual screening (VS) performance.
Figure 3: ChemmineR small molecule modeling environment with its add-on packages and selected functionalities.
Figure 4: Selectivity Analysis with ChemmineR and bioassayR
Functional Annotation of Gene and Protein Sequences
Computational methods for characterizing the functions of protein sequences
play an important role in the discovery of novel molecular, biochemical and
regulatory activities. To facilitate this process, we have developed the
sub-HMM algorithm that extends the application spectrum of profile HMMs to
motif discovery and active site prediction in protein sequences (Horan et al.
2010). Its most interesting
utility is the identification of the functionally relevant residues in proteins
of known and unknown function (Figure 5). Additionally, sub-HMMs can be used
for highly localized sequence similarity searches that focus on shorter
conserved features rather than entire domains or global similarities. As part
of this study we have predicted a comprehensive set of putative active sites
for all protein families available in the Pfam database which has become a
valuable knowledge resource for characterizing protein functions in the future.
Figure 5: Illustration of the sub-HMM extraction process from conserved protein domains, here Pfam desaturase domain (PF00487).