GEN242 – Projects

Assignments: Overview of Course Projects

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

During the tutorial sessions of this class all students will perform the basic data analysis of at least two NGS Workflows including RNA-Seq and VAR-Seq. In addition, every student will work on a Challenge Project addressing a specific data analysis task within one of the general NGS Workflows. Students will also present a scientific paper closely related to their challenge topic (see here). To facilitate teamwork and communication with instructors, each course project will be assigned a private GitHub repository.

The results of the Challenge Projects will be presented by each student during the last week of the course (see Slideshow Template here). In addition, each student will write a detailed analysis report for the assigned course project. This report needs to include all analysis steps of the corresponding NGS Workflow (e.g. full RNA-Seq analysis) as well as the code and results of the Challenge Project. The final project reports will be written in R Markdown. A basic tutorial on R Markdown is available here. Both the R Markdown script (.Rmd) along with the rendered HTML or PDF report will be submitted to each student’s private project GitHub repository. All helper code used for the challenge project needs to be organized in well documented R functions of each project’s *_Fct.R script. The custom functions defined in *_Fct.R need to be imported (sourced) and used in the main Rmd project report. Other scripts used by the challenge projects need to be called from the *_Fct.R (e.g. via R’s system function) and also uploaded to the project repos. The expected structure of the final project report is outlined below.

The reports should be submitted to each student’s private project GitHub repository. For the report each student should create in this repository a new directory named after their workflow project and include in it the following files:

.Rmd source script of project report
Report rendered from .Rmd source in HTML or PDF format
._Fct.R file containing all helper functions written for challenge project
Submission Deadline for reports: 6:00 PM, June 11th, 2024

Structure of final project report

Abstract
Introduction
Methods
- Short description of methods used by NGS workflow
- Detailed description of methods used for challenge project
Results and Discussion
- Includes all components of NGS workflow as well as challenge project
Conclusions
Acknowledgments
References
Supplement (optional)

Assignments: RNA-Seq - NGS Aligners

Mon, 01 Jan 0001 00:00:00 +0000

RNA-Seq Workflow

Read quality assessment, filtering and trimming
Map reads against reference genome
Perform read counting for required ranges (e.g. exonic gene ranges)
Normalization of read counts
Identification of differentially expressed genes (DEGs)
Clustering of gene expression profiles
Gene set enrichment analysis

Challenge Project: Comparison of RNA-Seq Aligners

Run the above workflow from start to finish (steps 1-7) on the RNA-Seq data set from Howard et al. (2013).
Challenge project tasks
- Compare the RNA-Seq aligner HISAT2 with at least 1-2 other aligners, such as Rsubread, Star or Kallisto. Evaluate the impact of the aligner on the downstream analysis results including:
  - Read counts
  - Differentially expressed genes (DEGs)
  - Generate plots that compare the results efficiently

References

Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. doi: 10.1038/nbt.3519 PubMed
Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. PubMed
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. doi: 10.1186/gb-2013-14-4-r36 PubMed
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12: 357–360 PubMed
Liao Y, Smyth GK, Shi W (2013) The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res 41: e108 PubMed

Assignments: RNA-Seq - DEG Analysis Methods

Mon, 01 Jan 0001 00:00:00 +0000

RNA-Seq Workflow

Read quality assessment, filtering and trimming
Map reads against reference genome
Perform read counting for required ranges (e.g. exonic gene ranges)
Normalization of read counts
Identification of differentially expressed genes (DEGs)
Clustering of gene expression profiles
Gene set enrichment analysis

Challenge Projects

1. Comparison of DEG analysis methods

Run the workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013).
Challenge project tasks
- Compare the DEG analysis method chosen for the paper presentation with at least 1-2 additional methods (e.g. one student compares edgeR vs. baySeq, and the other student DESeq2 vs. limma/voom). Assess the results as follows:
  - Analyze the the similarities and differences in the DEG lists obtained from the two methods using intersect matrices, venn diagrams and/or upset plots.
  - Assess the impact of the DEG method on the downstream gene set enrichment analysis?
  - Plot the performance of the DEG methods in thevform of ROC curves and record their AUC values. A consensus DEG set or the one from the Howard et al. (2013) paper could be used as the ‘pseudo’ ground truth result.

2. Comparison of DEG analysis methods

Similar as above but with different combination of DEG methods and/or performance testing approach.

References

Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. PubMed
Guo Y, Li C-I, Ye F, Shyr Y (2013) Evaluation of read count based RNAseq analysis methods. BMC Genomics 14 Suppl 8: S2 PubMed
Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422 PubMed
Liu R, Holik AZ, Su S, Jansz N, Chen K, Leong HS, Blewitt ME, Asselin-Labat M-L, Smyth GK, Ritchie ME (2015) Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res. doi: 10.1093/nar/gkv412. PubMed
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15: 550 PubMed
Zhou X, Lindsay H, Robinson MD (2014) Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res 42: e91 PubMed

Assignments: Cluster and Network Analysis Methods

Mon, 01 Jan 0001 00:00:00 +0000

RNA-Seq Workflow

Read quality assessment, filtering and trimming
Map reads against reference genome
Perform read counting for required ranges (e.g. exonic gene ranges)
Normalization of read counts
Identification of differentially expressed genes (DEGs)
Clustering of gene expression profiles
Gene set enrichment analysis

Challenge Projects

1. Cluster and network analysis methods

Run the workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013)
Challenge project tasks
- Compare at least 2-3 cluster analysis methods (e.g. Clust, hierarchical, k-means, Fuzzy C-Means, WGCNA, other) and assess the performance differences as follows:
  - Analyze the similarities and differences in the cluster groupings obtained from the two methods.
  - Do the differences affect the results of the downstream functional enrichment analysis?
  - Plot the performance of the clustering methods in form of ROC curves and/or record their AUC values. Functional annotations (e.g. GO, KEGG, Pfam) could be used as ‘pseudo’ ground truth.

2. Cluster and network analysis methods

Similar as above but with different combination of clustering methods and/or performance testing approach.

References

Abu-Jamous B, Kelly S (2018) Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol 19: 172 PubMed
Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. PubMed
Langfelder P, Luo R, Oldham MC, Horvath S (2011) Is my network module preserved and reproducible? PLoS Comput Biol 7: e1001057. PubMed
Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9: 559–559. PubMed
Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa L da F, Rodrigues FA (2019) Clustering algorithms: A comparative approach. PLoS One 14: e0210236. PubMed

Assignments: RNA-Seq - Differentially Expressed Transcript (DET) Analysis

Mon, 01 Jan 0001 00:00:00 +0000

RNA-Seq Workflow

Read quality assessment, filtering and trimming
Map reads against reference genome
Perform read counting for required ranges (e.g. exonic gene ranges)
Normalization of read counts
Identification of differentially expressed genes (DEGs)
Clustering of gene expression profiles
Gene set enrichment analysis

Challenge Projects

Analysis of Differentially Expressed Exons and Transcripts

Run the workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013).
Challenge project tasks
- Group 1: Perform differential exon analysis with DEXseq. Assess the results as follows:
  - Identify genes that show differential exon usage according to DEXseq. Optionally, perform functional gene set enrichment analysis on the obained gene set.
  - Compare the results with the findings of the splice variant analysis reported by Howard et al (2013).
  - Optional: compare the performance of DEXseq and Kallisto/Sleuth (see below) with the results from the Howard et al (2013) paper in the form of ROC plots. As ‘pseudo’ ground truth the consensus DET set or similar could be used.
- Group 2: Same as above but with Kallisto/Sleuth (Pimentel et al, 2017) or DTUrtle (Tekath and Dugas, 2021).

References

Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22: 2008–2017 PubMed
Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. PubMed
Guo Y, Li C-I, Ye F, Shyr Y (2013) Evaluation of read count based RNAseq analysis methods. BMC Genomics 14 Suppl 8: S2 PubMed
Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422 PubMed
Liu R, Holik AZ, Su S, Jansz N, Chen K, Leong HS, Blewitt ME, Asselin-Labat M-L, Smyth GK, Ritchie ME (2015) Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res. doi: 10.1093/nar/gkv412. PubMed
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15: 550 PubMed
Pimentel H, Bray NL, Puente S, Melsted P, Pachter L (2017) Differential analysis of RNA-seq incorporating quantification uncertainty. Nat Methods 14: 687–690. PubMed
Tekath T, Dugas M (2021) Differential transcript usage analysis of bulk and single-cell RNA-seq data with DTUrtle. Bioinformatics 37: 3781–3787. PubMed
Zhou X, Lindsay H, Robinson MD (2014) Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res 42: e91 PubMed

Assignments: Clustering and Embedding Methods for scRNA-Seq

Mon, 01 Jan 0001 00:00:00 +0000

RNA-Seq Workflow

Read quality assessment, filtering and trimming
Map reads against reference genome
Perform read counting for required ranges (e.g. exonic gene ranges)
Normalization of read counts
Identification of differentially expressed genes (DEGs)
Clustering of gene expression profiles
Gene set enrichment analysis

Challenge Project

Clustering and Embedding Methods for scRNA-Seq

Run the above workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013).
Challenge project tasks
- Group 1 and 2 compare the partition performance of at least 3 clustering and 3 embedding methods, respectively, for high-dimensional gene expression data using single cell RNA-Seq data.
- The clustering methods can include SC3, TSCAM, Seurat, PCAkmeans, etc (for additional methods, see table 3 in Duò et al, 2018).
- The dimensionality reduction methods can include PCA, MDS, SC3, isomap, t-SNE, FIt-SNE, UMAP, runUMAP in scater Bioc package, etc.
- To obtain meaningful test results, choose an scRNA-Seq data set (here pre-processed count data) where the correct cell clustering is known (ground truth). For simplicity the data could be obtained from the scRNAseq package (Risso and Cole, 2020) or loaded from GEO (e.g. Shulse et al., 2019). For learning purposes, organize the data in a SingleCellExperiment object. How to work with SingleCellExperiment objects with embedding methods like t-SNE, the tutorial (here) of the scran package provides an excellent introduction.
- Optional: plot the (partitioning) performance in the form of ROC curves and/or record their AUC values.
- Compare your test results with published performance test results, e.g. Sun et al. (2019) or Duò et al. (2018).

References

Duò A, Robinson MD, Soneson C (2018) A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res 7: 1141. PubMed
Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. PubMed
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al (2017) SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14: 483–486. PubMed
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9 (Nov) : 2579-2605, 2008.
Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y (2019) Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat Methods 16: 243–245 PubMed (Note: this could be used as a more recent pub on t-SNE; the speed improved version is also available for R with a C)
McInnes L, Healy J, Melville J (2018) UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv
Risso D, Cole M (2020). scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets. R package version 2.4.0. -> Choose one scRNA-Seq data set from this Bioc data package for testing embedding methods. URL
Senabouth A, Lukowski SW, Hernandez JA, Andersen SB, Mei X, Nguyen QH, Powell JE (2019) ascend: R package for analysis of single-cell RNA-seq data. Gigascience. doi: 10.1093/gigascience/giz087. PubMed
Shulse CN, Cole BJ, Ciobanu D, Lin J, Yoshinaga Y, Gouran M, Turco GM, Zhu Y, O’Malley RC, Brady SM, et al (2019) High-Throughput Single-Cell Transcriptome Profiling of Plant Cell Types. Cell Rep 27: 2241–2247.e4 PubMed
Sun S, Zhu J, Ma Y, Zhou X (2019) Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol 20: 269. PubMed
Sun S, Zhu J, Zhou X (2020) Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat Methods. doi: 10.1038/s41592-019-0701-7. PubMed

Assignments: ChIP-Seq Peak Callers

Mon, 01 Jan 0001 00:00:00 +0000

ChIP-Seq Workflow

Read quality assessment, filtering and trimming
Align reads to reference genome
Compute read coverage across genome
Peak calling with different methods and consensus peak identification
Annotate peaks
Differential binding analysis
Gene set enrichment analysis
Motif prediction to identify putative TF binding sites

Challenge Projects

1. Comparison of peak calling methods

Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)
Challenge project tasks
- Call peaks with at least 2-3 software tools, such as MACS2, slice coverage calling (Bioc), PeakSeq, F-Seq, Homer, ChIPseqR, or CSAR.
- Compare the results with peaks identified by Kaufmann et al (2010)
- Report unique and common peaks among three methods and plot the results as venn diagrams
- Plot the performance of the peak callers in form of ROC plots. As true result set one can use the intersect of the peaks identified by all methods.

2. Comparison of peak calling methods

Similar as above but with different combination of peak calling methods and/or performance testing approach.

References

Feng J, Liu T, Qin B, Zhang Y, Liu XS (2012) Identifying ChIP-seq enrichment using MACS. Nat Protoc 7: 1728–1740. PubMed
Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. PubMed
Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, et al (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 22: 1813–1831. PubMed
Lun ATL, Smyth GK (2014) De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. Nucleic Acids Res 42: e95. PubMed
Muiño JM, Kaufmann K, van Ham RC, Angenent GC, Krajewski P (2011) ChIP-seq Analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions. Plant Methods 7: 11. PubMed
Wilbanks EG, Facciotti MT (2010) Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. doi: 10.1371/journal.pone.0011471. PubMed
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W, et al (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol. doi: 10.1186/gb-2008-9-9-r137. PubMed

Assignments: Functional enrichment analysis (FEA)

Mon, 01 Jan 0001 00:00:00 +0000

ChIP-Seq Workflow

Read quality assessment, filtering and trimming
Align reads to reference genome
Compute read coverage across genome
Peak calling with different methods and consensus peak identification
Annotate peaks
Differential binding analysis
Gene set enrichment analysis
Motif prediction to identify putative TF binding sites

Challenge Project: Functional enrichment analysis (FEA)

Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)
Challenge project tasks
- Perform functional enrichment analysis on the genes overlapping or downstream of the peak ranges discovered by the ChIP-Seq workflow.
- Compare at least 2 functional enrichment methods (e.g. GOCluster_Report, fgsea, chipenrich, goseq, GOstats) using KEGG/Reactome or Gene Ontology as functional annotation systems. Among the FEA methods include one based on the hypergeometric distribution (ORA) and one on the Gene Set Enrichment Analysis (GSEA) algorithm. Assess the results as follows:
  - Quantify the rank-based similarities of the functional categories among the chosen enrichment methods.
  - Determine whether the enrichment results match the biological expectations of the experiment (e.g. are certain biological processes enriched)?
  - Optional: visualize the results with one of the pathway or GO graph viewing tools.

References

Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. PubMed
Sergushichev A (2016) An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation. bioRxiv 060012
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102: 15545–15550. PubMed
Welch RP, Lee C, Imbriano PM, Patil S, Weymouth TE, Smith RA, Scott LJ, Sartor MA (2014) ChIP-Enrich: gene set enrichment testing for ChIP-seq data. Nucleic Acids Res 42: e105. PubMed
Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 11: R14. PubMed

Assignments: Motif Enrichment Analysis (MEA)

Mon, 01 Jan 0001 00:00:00 +0000

ChIP-Seq Workflow

Read quality assessment, filtering and trimming
Align reads to reference genome
Compute read coverage across genome
Peak calling with different methods and consensus peak identification
Annotate peaks
Differential binding analysis
Gene set enrichment analysis
Motif prediction to identify putative TF binding sites

Challenge Projects

1. Motif enrichment

Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)
Challenge project tasks
- Prioritize/rank peaks by FDR from differential binding analysis
- Parse peak sequences from genome
- Determine which motifs in the Jaspar database (motifDB) show the highest enrichment in the peak sequences. The motif enrichment tests can be performed with the PWMEnrich package. Basic starter code for accomplishing these tasks is provided here. The motif mapping can be performed with matchPWM or motifmatcher, and motif identification in databases can be performed with MotIV.
- To have distinct challenge project aspects for each of the two students in this project, one could use different peak ranking approaches, e.g. one ranks by FDR of differential binding analysis, and the other by coverage or p-values of peak caller.

2. Motif discovery

Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)
Challenge project tasks
- Use peaks discovered in workflow (step 1-7 above) for motif discovery
- Run discovery with at least two motif discovery tools (MEMEchip and BCRANK)
- Identify motifs that are identified by at least two discovery tools
- Identify motifs that are most similar to those reported by Kaufman et al. (2020) paper
- Optional: compare with known motifs in Jasper database

References

Frith, Martin C., Yutao Fu, Liqun Yu, Jiang‐fan Chen, Ulla Hansen, and Zhiping Weng. 2004. “Detection of Functional DNA Motifs via Statistical Over‐representation.” Nucleic Acids Research 32 (4): 1372–81. PubMed
Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. PubMed
Machanick P, Bailey TL (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27: 1696–1697. PubMed
McLeay, Robert C, and Timothy L Bailey. 2010. “Motif Enrichment Analysis: A Unified Framework and an Evaluation on ChIP Data.” BMC Bioinformatics 11: 165. PubMed
Tompa, M, N Li, T L Bailey, G M Church, B De Moor, E Eskin, A V Favorov, et al. 2005. “Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites.” Nature Biotechnology 23 (1): 137–44. PubMed
Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. PubMed

Assignments: Drug-target analysis

Mon, 01 Jan 0001 00:00:00 +0000

ChIP-Seq Workflow

Read quality assessment, filtering and trimming
Align reads to reference genome
Compute read coverage across genome
Peak calling with different methods and consensus peak identification
Annotate peaks
Differential binding analysis
Gene set enrichment analysis
Motif prediction to identify putative TF binding sites

Challenge Project: Drug-target analysis of proteins encoded by genes in peak regions

Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)
Challenge project tasks
- Identify protein coding genes in peak regions
- Identify corresponding human orthologs
- Perform drug-target annotation analysis, e.g. with drugTargetInteractions package
- Identify similar drugs with two different structural similarity search algorithms (e.g. 2 fingerprint methods)
- Challenge question: which of the two structural similarity search tools identifies more similar small molecules that have annotated protein targets in ChEMBL (DrugBank). Explore options on how to visualize the performance results.

References

Chen X, Reynolds CH (2002) Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J Chem Inf Comput Sci 42: 1407–1414 PubMed
Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. PubMed

Assignments: Genome Summary Graphics

Mon, 01 Jan 0001 00:00:00 +0000

ChIP-Seq Workflow

Read quality assessment, filtering and trimming
Align reads to reference genome
Compute read coverage across genome
Peak calling with different methods and consensus peak identification
Annotate peaks
Differential binding analysis
Gene set enrichment analysis
Motif prediction to identify putative TF binding sites

Challenge Project: Programmable graphics for visualizing genomic features

Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)
Challenge project tasks
- This project focuses on the visualization of patterns in NGS experiments (e.g. consensus motifs in ChIP-Seq peaks) to discover novel features in genomes. The visualization backend should be based on one of the programmable and extendable R/Bioconductor environments such as ggplot2 (ggplotly), ggbio, Gviz, RCircos, etc. For instance, this could include:
  - The generation of motif logos (e.g. for ChIP-Seq peaks) for any number of sequence ranges of interest.
  - Integration of the results with functional annotation information (e.g. protein families from Pfam, exonic regions coding for disordered structures), pathways and/or GO.
  - Incorporation of quantitative information such as relative or differential abundance information obtained from the corresponding NGS profiling technology.
  - If there is interest, a Shiny App could be included to run the developed R functions interactively from a web browser.

References

Hahne F, Ivanek R (2016). “Statistical Genomics: Methods and Protocols.” In Mathé E, Davis S (eds.), chapter Visualizing Genomic Data Using Gviz and Bioconductor, 335–351. Springer New York, New York, NY. ISBN 978-1-4939-3578-9, doi: 10.1007/978-1-4939-3578-9_16. PubMed
Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. PubMed
Yin T, Cook D, Lawrence M (2012). “ggbio: an R package for extending the grammar of graphics for genomic data.” Genome Biology, 13(8), R77. PubMed
Zhang H, Meltzer P, Davis S (2013) RCircos: an R package for Circos 2D track plots. BMC Bioinformatics 14: 244–244. PubMed

Assignments: lncRNAs and other features

Mon, 01 Jan 0001 00:00:00 +0000

ChIP-Seq Workflow

Read quality assessment, filtering and trimming
Align reads to reference genome
Compute read coverage across genome
Peak calling with different methods and consensus peak identification
Annotate peaks
Differential binding analysis
Gene set enrichment analysis
Motif prediction to identify putative TF binding sites

Challenge Project: Functional enrichment analysis (FEA)

Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)
Challenge project tasks
- Parses DNA sequences of identified peak footprints
- Identify in the identified peak sequences 1-2 of the following feature types:
  - Long non-coding RNAs (lncRNAs; Han et al., 2019; Hu et al., 2017)
  - Open reading frames (ORFs)
  - miRNAs
  - Repeats

References

Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. PubMed
Han S, Liang Y, Ma Q, Xu Y, Zhang Y, Du W, Wang C, Li Y (2019) LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief Bioinform 20: 2009–2027 PubMed
Hu L, Xu Z, Hu B, Lu ZJ (2017) COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features. Nucleic Acids Res 45: e2 PubMed

Assignments: Project Data Management and Run Instructions

Mon, 01 Jan 0001 00:00:00 +0000

Big data space on HPCC

All larger data sets of the course projects will be organized in a big data space under /bigdata/gen242/<user_name>. Within this space, each student will work in a subdirectory named after their project:

/bigdata/gen242/<user_name>/<github_name>_project

Project GitHub repositories

Students will work on their course projects within GitHub repositories, one for each student. These project repositories are private and have been shared with each student. To populate a course project with an initial project workflow, please follow the instructions given in the following section.

Generate workflow environment with real project data

Log in to the HPCC cluster and set your working directory to bigdata or (/bigdata/gen242/<user_name>)
Clone the GitHub repository for your project with git clone ... (URLs are listed in Course Planning sheet) and then cd into this directory. As mentioned above, the project GitHub repos follow this naming convention: <github_name>_project.
Generate the workflow environment for your project on the HPCC cluster with genWorkenvir from systemPipeRdata.
Next, cd into the directory of your workflow, delete its default data and results directories, and then substitute them with empty directories outside of your project GitHub repos as follows (<workflow> needs to be replaced with actual workflow name):
```
mkdir ../../<workflow>_data
mkdir ../../<workflow>_results
```
Within your workflow directory create symbolic links pointing to the new directories created in the previous step. For instance, the projects using the RNA-Seq workflow should create the symbolic links for their data and results directories like this (<user_name> and <workflow> needs to be replaced with your user name and workflow name):
```
ln -s /bigdata/gen242/<user_name>/<workflow>_data data
ln -s /bigdata/gen242/<user_name>/<workflow>_results results
```
Add the workflow directory to the GitHub repository of your project with git add -A and then run commit and push as outlined in the GitHub instructions of this course here. After this check whether the workflow directory and its content shows up on your project’s online repos on GitHub. Very important: make sure that the data and results are empty at this point. If not investigate why and fix the problem in the corresponding step above.
Download the FASTQ files of your project with getSRAfastq (see below) to the data directory of your project.
Generate a proper targets file for your project where the first column(s) point(s) to the downloaded FASTQ files. In addition, provide sample names matching the experimental design (columns: SampleNames and Factor). More details about the structure of targets files are provided here. Ready to use targets files for the RNA-Seq, ChIP-Seq and VAR-Seq projects can be downloaded as tab separated (TSV) files from here. Alternatively, one can download the corresponding Google Sheets with the read_sheet function from the googlesheets4 package (RNA-Seq GSheet and ChIP-Seq GSheet).
Inspect and adjust the .param files you will be using. For instance, make sure the software modules you are loading and the path to the reference genome are correct.
Every time you start working on your project you cd into the directory of the repository and then run git pull to get the latest changes. When you are done, you commit and push your changes back to GitHub with git commit -am "some edits"; git push.

Download of project data

After logging in to one of the computer nodes via srun, open R from within the GitHub repository of your project and then run the following code section, but only those that apply to your project.

FASTQ files from SRA

Choose FASTQ data for your project

The FASTQ files for the ChIP-Seq project are from SRA study SRP002174 (Kaufman et al. 2010)

sraidv <- paste("SRR0388", 45:51, sep="")

The FASTQ files for the RNA-Seq project are from SRA study SRP010938 (Howard et al. 2013)

sraidv <- paste("SRR4460", 27:44, sep="")

The FASTQ files for the VAR-Seq project are from SRA study SRP008819 and SRP007172 (Lu et al 2012). Work only with one of the two studies by using the corresponding targets file (see above).

sraidv <- c(paste("SRR1051", 389:415, sep=""), c("SRR352145", "SRR279136"))

Load libraries and modules

library(systemPipeR)                                                                                                                                                                
moduleload("sratoolkit/3.0.0")                                                                                                                                                      
system("vdb-config --prefetch-to-cwd") # sets download default to current directory                                                                                          
# system('prefetch --help') # helps to speed up fastq-dump
# system('vdb-config -i') # allows to change SRA Toolkit configuration; instructions are here: https://bit.ly/3lzfU4P
# system('fastq-dump --help') # below uses this one for backwards compatibility                                                                                                     
# system('fasterq-dump --help') # faster than fastq-dump

Define download function

The following function downloads and extracts the FASTQ files for each project from SRA. Internally, it uses the prefetch and fastq-dump utilities of the SRA Toolkit from NCBI. The faster fasterq-dump alternative (see comment line below) is not used here for historical reasons. Note, if you use the SRA Toolkit in your HPCC user account for the first time, then it might ask you to configure it by running vdb-config --interactive from the command-line. In the resulting dialog, one can keep the default settings, and then save and exit. By running prior to any FASTQ file downloads vdb-config --prefetch-to-cwd, the download location will be set to the current working directory (see above).

getSRAfastq <- function(sraid, threads=1) {                                                                                                                                         
    system(paste("prefetch", sraid)) # makes download faster                                                                                                                        
    system(paste("vdb-validate", sraid)) # checks integrity of the downloaded SRA file                                                                   
    system(paste("fastq-dump --split-files --gzip", sraid)) # gzip option makes it slower but saves storage space                                                                   
    # system(paste("fasterq-dump --threads 4 --split-files --progress ", sraid, "--outdir .")) # Faster alternative to fastq-dump                                                   
    unlink(x=sraid, recursive = TRUE, force = TRUE) # deletes sra download directory                                                                                                
}

To stop the loop after a failure is detected by vdb-validate, use && operator like this: prefetch sraid && vdb-validate sraid && fastq-dump sraid.

Run download

Note the following performs the download in serialized mode for the chosen data set and saves the extracted FASTQ files to the current working directory.

mydir <- getwd(); setwd("data")
for(i in sraidv) getSRAfastq(sraid=i)
setwd(mydir)
## Check whether all FASTQ files were downloaded
downloaded_files <- list.files('./data', pattern='fastq.gz$')
all(sraidv %in% gsub("_.*", "", downloaded_files)) # Should be TRUE

Alternatively, the download can be performed in parallelized mode with BiocParallel. Please run this version only on a compute node.

mydir <- getwd(); setwd("data")
# bplapply(sraidv, getSRAfastq, BPPARAM = MulticoreParam(workers=4))
setwd(mydir)

Avoid FASTQ download

To save time, skip the download of the FASTQ files. Instead generate in the data directory of your workflow symlinks to already downloaded FASTQ files.

fastq_symlink <- function(workflow) {
    file_paths <- list.files(file.path("/bigdata/gen242/data", workflow, "data"), pattern='fastq.gz$', full.names=TRUE)
    for(i in seq_along(file_paths)) system(paste0("ln -s ", file_paths[i], " ./data/", basename(file_paths[i])))
}
workflow_type <- <choose: 'fastq_rnaseq' or 'fastq_varseq'> # Choose here correct workflow
fastq_symlink(workflow=workflow_type)

Download reference genome and annotation

The following downloadRefs function downloads the Arabidopsis thaliana genome sequence and GFF file from the TAIR FTP site. It also assigns consistent chromosome identifiers to make them the same among both the genome sequence and the GFF file. This is important for many analysis routines such as the read counting in the RNA-Seq workflow.

downloadRefs <- function(rerun=FALSE) {
   if(rerun==TRUE) {
        library(Biostrings)
        download.file("https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz", "./data/tair10.fasta.gz")
        R.utils::gunzip("./data/tair10.fasta.gz")
        download.file("https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gff3.gz", "./data/tair10.gff.gz")
        R.utils::gunzip("./data/tair10.gff.gz")
        txdb <- GenomicFeatures::makeTxDbFromGFF(file = "data/tair10.gff", format = "gff", dataSource = "TAIR", organism = "Arabidopsis thaliana")
        AnnotationDbi::saveDb(txdb, file="./data/tair10.sqlite")
        download.file("https://cluster.hpcc.ucr.edu/~tgirke/Teaching/GEN242/data/tair10_functional_descriptions", "./data/tair10_functional_descriptions")
    }
}

After importing/sourcing the above function, execute it as follows:

downloadRefs(rerun=TRUE)

Workflow Rmd file

To run the actual data analysis workflows, each project can use the Rmd file obtained from the genWorkenvir(workflow='...') call directly. The RNA-Seq group might want to work with the sprnaseq.Rmd file used in the tutorial here.

Recommendations for running workflows

Run instructions

The following provides recommendations and additional options to consider for running and modifying workflows. This also includes parallelization settings for the specific data used by the class projects. Note, additional details can be found in this and other sections of the workflow introduction tutorial here. Importantly, the following should be run from within an srun session.

library(systemPipeR)                                                                                                                                                                
sal <- SPRproject() # when running a WF for first time                                                                                                                                      
sal                                                                                                                                                                                 
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd") # populates sal with WF steps defined in Rmd                                                                                                                      
sal
# sal <- SPRproject(resume=TRUE) # when restarting a WF, skip above steps and resume WF with this command                                                                                                                                               
getRversion() # should be 4.2.2. Note, R version can be changed with `module load ...`                                                                                                                                                     
system("hostname") # should return number of a compute node; if not close Nvim-R session, log in to a compute node with srun and then restart Nvim-R session                                                                                                                                                                     
# sal <- runWF(sal) # runs WF serialized. Not recommended since this will take much longer than parallel mode introduced below by taking advantage of resource allocation
resources <- list(conffile=".batchtools.conf.R",                                                                                                                                    
                  template="batchtools.slurm.tmpl",                                                                                                                                 
                  Njobs=18, # chipseq should use here number of fastq files (7 or 8)                                                                                                                                                        
                  walltime=180, ## minutes                                                                                                                                          
                  ntasks=1,                                                                                                                                                         
                  ncpus=4,                                                                                                                                                          
                  memory=4096, ## Mb                                                                                                                                                
                  partition = "gen242",                                                                                                                                              
                  account = "gen242"
                  )                                                                                                                                                                 
## Note, some users might need to update the provided `batchtools.slurm.tmpl` file in their workflow directory by running the following download command: 
# download.file("https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/batchtools.slurm.tmpl", "batchtools.slurm.tmpl")
## Alternatively, changing above "gen242" to "epyc" under partition will also work.
## For RNA-Seq project use:
sal <- addResources(sal, step = c("preprocessing", "trimming", "hisat2_mapping"), resources = resources) # parallelizes time consuming computations assigned to `step` argument                                                                           
## For VAR-Seq project use this line instead:
# sal <- addResources(sal, step = c("preprocessing", "bwa_alignment"), resources = resources)
## For ChIP-Seq project use this line instead:
# sal <- addResources(sal, step = c("preprocessing", "bowtie2_alignment"), resources = resources)
## For VAR-Seq project use this line instead:
# sal <- addResources(sal, c("bwa_alignment"), resources = resources)
sal <- runWF(sal) # runs entire workflow; specific steps can be executed by assigning their corresponding position numbers within the workflow to the `steps` argument (see ?runWF)                                                                                                                                                               
sal <- renderReport(sal) # after workflow has completed render Rmd to HTML report (default name is SPR_Report.html) and view it via web browser which requires symbolic link in your ~/.html folder. 
rmarkdown::render("systemPipeRNAseq.Rmd", clean=TRUE, output_format="BiocStyle::html_document") # Alternative approach for rendering report from Rmd file instead of sal object

Modify a workflow

If needed one can modify existing workflow steps in a pre-populated SYSargsList object, and potentially already executed WF, with the replaceStep(sal) <- replacement function. The following gives an example where step number 3 in a SYSargsList (sal) object will be updated with modified or new code. Note, this is a generalized example where the user needs to insert the code lines and also adjust the values assigned to the arguments: step_name and dependency. Additional details on this topic are available in the corresponding section of systemPipeR’s introductory tutorial here.

replaceStep(sal, step=3) <- LineWise(                                                                                                                                                        
    code = {                                                                                                                                                                        
        << my modified code lines >>
        },                                                                                                                                                                          
    step_name = << "my_step_name" >>,                                                                                                                                                        
    dependency = << "my_dependency" >>)

Since step_names need to be unique, one should avoid using the same step_name as before. If the previous name is used, a default name will be assigned. Rerunning the assignment will then allow to assign the previous name. This behavior is enforced for version tracking. Subsequently, one can view and check the code changes with codeLine(), and then rerun the corresponding step (here 3) as follows:

codeLine(stepsWF(sal)$my_step_name)
runWF(sal, steps=3)

Note, any step in a workflow can only be run in isolation if its expected input exists (see dependency).

Adding steps to a workflow

New steps can be added to the Rmd file of a workflow by inserting new R Markdown code chunks starting and ending with the usual appendStep<- syntax and then creating a new SYSargsList instance with importWF that contain the new step(s). To add steps to a pre-populated SYSargsList object, one can use the after argument of the appendStep<- function. The following example will add a new step after position 3 to the corresponding sal object. This can be useful if a longer workflow has already been completed and one only wants to make some refinements without re-running the entire workflow.

appendStep(sal, after=3) <- << my_step_code >>

Submit workflow from command-line to cluster

In addition to running workflows within interactive R sessions, after logging in to a computer node with srun, one can execute them entirely from the command-line by including the relevant workflow run instructions in an R script. The R script can then be submitted via a Slurm submission script to the cluster. The following gives an example for the RNA-Seq workflow (ChIP-Seq version requires only minor adjustments). Additional details on this topic are available in the corresponding section of systemPipeR’s introductory tutorial here.

R script: wf_run_script.R
Slurm submission script: wf_run_script.sh

To test this out, users can generate in their user account of the cluster a workflow environment populated with the toy data as outlined here). After this, it is best to create within the workflow directory a subdirectory, e.g. called cl_sbatch_run, and then save the above two files (*.R and *.sh) to this subdirectory. Next, the parameters in both files need to be adjusted to match the type of workflow and the required computing resources. This includes the name of the Rmd file and scheduler resource settings such as: partition, Njobs, walltime, memory, etc. After all relevant settings have been set correctly, one can execute the workflow with sbatch within the cl_sbatch_run directory as follows:

sbatch wf_run_script.sh

Note, some users might need to replace in the root directory of their workflow the default batchtools.slurm.tmpl file with this upated version here. Alterntively, changing in the wf_run_script.R “gen242” to “epyc” under the partition argument will also work. After the submission to the cluster, one usually should check its status and progress with squeue -u <username> as well as by monitoring the content of the slurm-<jobid>.out file generated by the scheduler in the same directory. This file contains most of the STDOUT and STDERROR generated by a cluster job. Once everything is working on the toy data, users can run the workflow on the real data the same way.

Detailed step-by-step instruction for running the workflows from the command-line are provided in this wf_run_from_cl.R script.

Assignments: Compare Performance of Variant Callers

Mon, 01 Jan 0001 00:00:00 +0000

VAR-Seq Workflow

Read preprocessing: filtering, quality trimming
Alignments
Alignment statistics
Variant calling: focus of challenge project
Variant filtering
Variant annotation
Combine results from many samples
Summary statistics of samples

Challenge Project: Performance Comparisons of Variant Callers

Run the workflow from start to finish (steps 1-8) on the VAR-Seq data set from on the data set from Lu et al (2012).
Challenge project tasks
- Compare the performance of at least 2 variant callers, e.g. GATK, BCFtools, Octopus and DeepVariant. Include in your comparisons the following analysis/visualization steps (Barbitoff et al 2022; Cooke et al 2021; Li, 2011; Poplin et al 2018).
  1. Report unique and common variants identified by tested variant callers.
  2. Compare the results from (1) with the variants identified by Lu et al, 2012
  3. Plot results from 1.-2. as venn diagrams or similar (e.g. upset plots)
  4. If there is enough time and interest, plot the performance of the variant callers in the form of ROC plots and calculate AUC values. As pseudo ground truth, one can either use the published variants or the union of the variants identified by all methods.

References

Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV (2022) Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics 23: 155. PubMed
Cooke DP, Wedge DC, Lunter G (2021) A unified haplotype-based method for accurate and comprehensive variant calling. Nat Biotechnol 39: 885–892. PubMed
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: 491–498. PubMed
Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987–2993. PubMed
Lu P, Han X, Qi J, Yang J, Wijeratne AJ, Li T, Ma H (2012) Analysis of Arabidopsis genome-wide variations before and after meiosis and meiotic recombination by resequencing Landsberg erecta and all four products of a single meiosis. Genome Res 22: 508–518. PubMed
Poplin R, Chang P-C, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT, et al (2018) A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36: 983–987. PubMed

Assignments:

Mon, 01 Jan 0001 00:00:00 +0000

#############

STAR

#############

Read mapping with `STAR`

library(systemPipeR)

## Loading required package: Rsamtools

## Loading required package: GenomeInfoDb

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## Loading required package: GenomicRanges

## Loading required package: Biostrings

## Loading required package: XVector

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## Loading required package: ShortRead

## Loading required package: BiocParallel

## Loading required package: GenomicAlignments

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

# sal_test <- SPRproject(logs.dir= ".SPRproject_test") # use this line when .SPRproject_test doesn't exist yet
sal_test <- SPRproject(overwrite = TRUE, logs.dir= ".SPRproject_test")

## Creating directory:  /home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/data 
## Creating directory:  /home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/results 
## Creating directory '/home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/.SPRproject_test'
## Creating file '/home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/.SPRproject_test/SYSargsList.yml'

appendStep(sal_test) <- LineWise(code = {
                library(systemPipeR)
                }, step_name = "load_SPR")

Read preprocessing

Preprocessing with `preprocessReads` function

appendStep(sal_test) <- SYSargsList(
    step_name = "preprocessing",
    targets = "targetsPE.txt", dir = TRUE,
    wf_file = "preprocessReads/preprocessReads-pe.cwl",
    input_file = "preprocessReads/preprocessReads-pe.yml",
    dir_path = system.file("extdata/cwl", package = "systemPipeR"),
    inputvars = c(
        FileName1 = "_FASTQ_PATH1_",
        FileName2 = "_FASTQ_PATH2_",
        SampleName = "_SampleName_"
    ),
    dependency = c("load_SPR"))

Alignments with `STAR`

`STAR` Indexing

appendStep(sal_test) <- SYSargsList(
    step_name = "star_index", 
    dir = FALSE, 
    targets=NULL, 
    wf_file = "star/star-index.cwl", 
    input_file="star/star-index.yml",
    dir_path="param/cwl", 
    dependency = "load_SPR"
)

`STAR` mapping

appendStep(sal_test) <- SYSargsList(
    step_name = "star_mapping",
    dir = TRUE, 
    targets ="preprocessing", 
    wf_file = "star-mapping-pe.cwl",
    input_file = "star-mapping-pe.yml",
    dir_path = "param/star_test",
    inputvars = c(preprocessReads_1 = "_FASTQ_PATH1_", preprocessReads_2 = "_FASTQ_PATH2_", 
                  SampleName = "_SampleName_"),
    rm_targets_col = c("FileName1", "FileName2"), 
    dependency = c("preprocessing", "star_index")
)

## Return command-line calls for STAR
cmdlist(sal_test, step="star_mapping", targets=1)

## BAM outpaths required for read counting below
outpaths <- getColumn(sal_test, step = "star_mapping", "outfiles", column = "Aligned_toTranscriptome_out_bam")
file.exists(outpaths) # Will not return TRUE until STAR completed sucessfully

## To run sal_test stepwise, make sure you have constructed your 
## sal_test object step-by-step starting from an empty sal_test
## as shown above under chunk: intialize_sal_for_testing 
sal_test <- runWF(sal_test, steps=c(1)) # increment step number one by one just for checking
sal_test
outpaths <- getColumn(sal_test, step = "star_mapping", "outfiles", column = "Aligned_toTranscriptome_out_bam")
outpaths
file.exists(outpaths) # Will not return TRUE until STAR completed sucessfully

## The following can be used for setting up things initial testing
starPE <- loadWorkflow(targets = "targetsPE.txt", wf_file = "star-mapping-pe.cwl", 
                       input_file = "star-mapping-pe.yml", dir_path = "./param/star_test")
starPE <- renderWF(starPE, inputvars = c(FileName1 = "_FASTQ_PATH1_", FileName2 = "_FASTQ_PATH2_", 
                                         SampleName = "_SampleName_"))
cmdlist(starPE)
runCommandline(starPE, make_bam = FALSE)