<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>GEN242 – Projects</title>
    <link>/assignments/projects/</link>
    <description>Recent content in Projects on GEN242</description>
    <generator>Hugo -- gohugo.io</generator>
    
	  <atom:link href="/assignments/projects/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Assignments: Overview of Course Projects</title>
      <link>/assignments/projects/project_overview/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/project_overview/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;During the tutorial sessions of this class all students will perform the basic
data analysis of at least two NGS Workflows including RNA-Seq and VAR-Seq.
In addition, every student will work on a Challenge
Project addressing a specific data analysis task within one of the general NGS
Workflows. Students will also present a scientific paper closely related to
their challenge topic (see
&lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/assignments/presentations/paper_presentations/&#34;&gt;here&lt;/a&gt;).
To facilitate teamwork and communication with instructors, each course project will be
assigned a private GitHub repository.&lt;/p&gt;
&lt;p&gt;The results of the Challenge Projects will be presented by each student
during the last week of the course (see Slideshow Template
&lt;a href=&#34;https://bit.ly/3oMz9gb&#34;&gt;here&lt;/a&gt;).
In addition, each student will write a detailed analysis report for the assigned
course project. This report needs to include all analysis steps of the
corresponding NGS Workflow (&lt;em&gt;e.g.&lt;/em&gt; full RNA-Seq analysis) as well as the
code and results of the Challenge Project. The final project reports will be written
in R Markdown. A basic tutorial on R Markdown is available &lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rmarkdown/rmarkdown/&#34;&gt;here&lt;/a&gt;.
Both the R Markdown script (&lt;code&gt;.Rmd&lt;/code&gt;) along with the rendered HTML or PDF report will
be submitted to each student&amp;rsquo;s private project GitHub repository. All helper code used for
the challenge project needs to be organized in well documented R functions of each
project&amp;rsquo;s &lt;code&gt;*_Fct.R&lt;/code&gt; script. The custom functions defined in &lt;code&gt;*_Fct.R&lt;/code&gt; need to be imported (sourced)
and used in the main Rmd project report. Other scripts used by the challenge projects need to be called from the &lt;code&gt;*_Fct.R&lt;/code&gt; (&lt;em&gt;e.g.&lt;/em&gt; via R&amp;rsquo;s &lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rprogramming/rprogramming/#calling-external-software&#34;&gt;system function&lt;/a&gt;) and also uploaded to the project repos. The expected structure of the final project report is outlined below.&lt;/p&gt;
&lt;p&gt;The reports should be submitted to each student’s private project GitHub repository. For
the report each student should create in this repository a new directory named after their
workflow project and include in it the following files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.Rmd&lt;/code&gt; source script of project report&lt;/li&gt;
&lt;li&gt;Report rendered from &lt;code&gt;.Rmd&lt;/code&gt; source in HTML or PDF format&lt;/li&gt;
&lt;li&gt;&lt;code&gt;._Fct.R&lt;/code&gt; file containing all helper functions written for challenge project&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Submission Deadline&lt;/strong&gt; for reports: 6:00 PM, June 11th, 2024&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure-of-final-project-report&#34;&gt;Structure of final project report&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Abstract&lt;/li&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Methods
&lt;ul&gt;
&lt;li&gt;Short description of methods used by NGS workflow&lt;/li&gt;
&lt;li&gt;Detailed description of methods used for challenge project&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Results and Discussion
&lt;ul&gt;
&lt;li&gt;Includes all components of NGS workflow as well as challenge project&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Conclusions&lt;/li&gt;
&lt;li&gt;Acknowledgments&lt;/li&gt;
&lt;li&gt;References&lt;/li&gt;
&lt;li&gt;Supplement (optional)&lt;/li&gt;
&lt;/ol&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: RNA-Seq - NGS Aligners</title>
      <link>/assignments/projects/01_rnaseq_aligners/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/01_rnaseq_aligners/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;rna-seq-workflow&#34;&gt;RNA-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Map reads against reference genome&lt;/li&gt;
&lt;li&gt;Perform read counting for required ranges (&lt;em&gt;e.g.&lt;/em&gt; exonic gene ranges)&lt;/li&gt;
&lt;li&gt;Normalization of read counts&lt;/li&gt;
&lt;li&gt;Identification of differentially expressed genes (DEGs)&lt;/li&gt;
&lt;li&gt;Clustering of gene expression profiles&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-project-comparison-of-rna-seq-aligners&#34;&gt;Challenge Project: Comparison of RNA-Seq Aligners&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Run the above workflow from start to finish (steps 1-7) on the RNA-Seq data set from Howard et al. (2013).&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Compare the RNA-Seq aligner HISAT2 with at least 1-2 other aligners, such as Rsubread, Star or Kallisto. Evaluate the impact of the aligner on the downstream analysis results including:
&lt;ul&gt;
&lt;li&gt;Read counts&lt;/li&gt;
&lt;li&gt;Differentially expressed genes (DEGs)&lt;/li&gt;
&lt;li&gt;Generate plots that compare the results efficiently&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. doi: 10.1038/nbt.3519 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/27043002&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24098335&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. doi: 10.1186/gb-2013-14-4-r36 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/23618408&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12: 357–360 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/25751142&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Liao Y, Smyth GK, Shi W (2013) The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res 41: e108 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/23558742&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: RNA-Seq - DEG Analysis Methods</title>
      <link>/assignments/projects/02_rnaseq_deg/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/02_rnaseq_deg/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;rna-seq-workflow&#34;&gt;RNA-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Map reads against reference genome&lt;/li&gt;
&lt;li&gt;Perform read counting for required ranges (&lt;em&gt;e.g.&lt;/em&gt; exonic gene ranges)&lt;/li&gt;
&lt;li&gt;Normalization of read counts&lt;/li&gt;
&lt;li&gt;Identification of differentially expressed genes (DEGs)&lt;/li&gt;
&lt;li&gt;Clustering of gene expression profiles&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-projects&#34;&gt;Challenge Projects&lt;/h2&gt;
&lt;h3 id=&#34;1-comparison-of-deg-analysis-methods&#34;&gt;1. Comparison of DEG analysis methods&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Run the workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013).&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Compare the DEG analysis method chosen for the paper presentation with at least 1-2 additional methods (&lt;em&gt;e.g.&lt;/em&gt; one student compares &lt;em&gt;edgeR&lt;/em&gt; &lt;em&gt;vs.&lt;/em&gt; &lt;em&gt;baySeq&lt;/em&gt;, and the other student &lt;em&gt;DESeq2&lt;/em&gt; &lt;em&gt;vs.&lt;/em&gt; &lt;em&gt;limma/voom&lt;/em&gt;). Assess the results as follows:
&lt;ul&gt;
&lt;li&gt;Analyze the the similarities and differences in the DEG lists obtained from the two methods using intersect matrices, venn diagrams and/or upset plots.&lt;/li&gt;
&lt;li&gt;Assess the impact of the DEG method on the downstream gene set enrichment analysis?&lt;/li&gt;
&lt;li&gt;Plot the performance of the DEG methods in thevform of ROC curves and record their AUC values. A consensus DEG set or the one from the Howard et al. (2013) paper could be used as the ‘pseudo’ ground truth result.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-comparison-of-deg-analysis-methods&#34;&gt;2. Comparison of DEG analysis methods&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Similar as above but with different combination of DEG methods and/or performance testing approach.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24098335&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Guo Y, Li C-I, Ye F, Shyr Y (2013) Evaluation of read count based RNAseq analysis methods. BMC Genomics 14 Suppl 8: S2 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24564449&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20698981/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Liu R, Holik AZ, Su S, Jansz N, Chen K, Leong HS, Blewitt ME, Asselin-Labat M-L, Smyth GK, Ritchie ME (2015) Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res. doi: 10.1093/nar/gkv412. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/25925576/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15: 550 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/25516281&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Zhou X, Lindsay H, Robinson MD (2014) Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res 42: e91 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24753412&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Cluster and Network Analysis Methods</title>
      <link>/assignments/projects/03_cluster_analysis/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/03_cluster_analysis/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;rna-seq-workflow&#34;&gt;RNA-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Map reads against reference genome&lt;/li&gt;
&lt;li&gt;Perform read counting for required ranges (&lt;em&gt;e.g.&lt;/em&gt; exonic gene ranges)&lt;/li&gt;
&lt;li&gt;Normalization of read counts&lt;/li&gt;
&lt;li&gt;Identification of differentially expressed genes (DEGs)&lt;/li&gt;
&lt;li&gt;Clustering of gene expression profiles&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-projects&#34;&gt;Challenge Projects&lt;/h2&gt;
&lt;h3 id=&#34;1-cluster-and-network-analysis-methods&#34;&gt;1. Cluster and network analysis methods&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Run the workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Compare at least 2-3 cluster analysis methods (e.g. Clust, hierarchical, k-means, Fuzzy C-Means, WGCNA, other) and assess the performance differences as follows:
&lt;ul&gt;
&lt;li&gt;Analyze the similarities and differences in the cluster groupings obtained from the two methods.&lt;/li&gt;
&lt;li&gt;Do the differences affect the results of the downstream functional enrichment analysis?&lt;/li&gt;
&lt;li&gt;Plot the performance of the clustering methods in form of ROC curves and/or record their AUC values. Functional annotations (e.g. GO, KEGG, Pfam) could be used as ‘pseudo’ ground truth.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-cluster-and-network-analysis-methods&#34;&gt;2. Cluster and network analysis methods&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Similar as above but with different combination of clustering methods and/or performance testing approach.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Abu-Jamous B, Kelly S (2018) Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol 19: 172 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/30359297/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24098335&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Langfelder P, Luo R, Oldham MC, Horvath S (2011) Is my network module preserved and reproducible? PLoS Comput Biol 7: e1001057. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/21283776/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9: 559–559. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/19114008/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa L da F, Rodrigues FA (2019) Clustering algorithms: A comparative approach. PLoS One 14: e0210236. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/30645617/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: RNA-Seq - Differentially Expressed Transcript (DET) Analysis</title>
      <link>/assignments/projects/02_rnaseq_dex/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/02_rnaseq_dex/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;rna-seq-workflow&#34;&gt;RNA-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Map reads against reference genome&lt;/li&gt;
&lt;li&gt;Perform read counting for required ranges (&lt;em&gt;e.g.&lt;/em&gt; exonic gene ranges)&lt;/li&gt;
&lt;li&gt;Normalization of read counts&lt;/li&gt;
&lt;li&gt;Identification of differentially expressed genes (DEGs)&lt;/li&gt;
&lt;li&gt;Clustering of gene expression profiles&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-projects&#34;&gt;Challenge Projects&lt;/h2&gt;
&lt;h3 id=&#34;analysis-of-differentially-expressed-exons-and-transcripts&#34;&gt;Analysis of Differentially Expressed Exons and Transcripts&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Run the workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013).&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Group 1: Perform differential exon analysis with &lt;a href=&#34;https://bioconductor.org/packages/release/bioc/html/DEXSeq.html&#34;&gt;DEXseq&lt;/a&gt;. Assess the results as follows:
&lt;ul&gt;
&lt;li&gt;Identify genes that show differential exon usage according to DEXseq. Optionally, perform functional gene set enrichment analysis on the obained gene set.&lt;/li&gt;
&lt;li&gt;Compare the results with the findings of the splice variant analysis reported by Howard et al (2013).&lt;/li&gt;
&lt;li&gt;Optional: compare the performance of DEXseq and Kallisto/Sleuth (see below) with the results from the Howard et al (2013) paper in the form of ROC plots. As &amp;lsquo;pseudo&amp;rsquo; ground truth the consensus DET set or similar could be used.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Group 2: Same as above but with &lt;a href=&#34;https://pachterlab.github.io/kallisto/download.html&#34;&gt;Kallisto&lt;/a&gt;/&lt;a href=&#34;https://github.com/pachterlab/sleuth&#34;&gt;Sleuth&lt;/a&gt; (Pimentel et al, 2017) or &lt;a href=&#34;https://tobitekath.github.io/DTUrtle/&#34;&gt;DTUrtle&lt;/a&gt; (Tekath and Dugas, 2021).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22: 2008–2017 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/22722343/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24098335&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Guo Y, Li C-I, Ye F, Shyr Y (2013) Evaluation of read count based RNAseq analysis methods. BMC Genomics 14 Suppl 8: S2 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24564449&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20698981/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Liu R, Holik AZ, Su S, Jansz N, Chen K, Leong HS, Blewitt ME, Asselin-Labat M-L, Smyth GK, Ritchie ME (2015) Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res. doi: 10.1093/nar/gkv412. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/25925576/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15: 550 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/25516281&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Pimentel H, Bray NL, Puente S, Melsted P, Pachter L (2017) Differential analysis of RNA-seq incorporating quantification uncertainty. Nat Methods 14: 687–690. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/28581496/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Tekath T, Dugas M (2021) Differential transcript usage analysis of bulk and single-cell RNA-seq data with DTUrtle. Bioinformatics 37: 3781–3787. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/34469510/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Zhou X, Lindsay H, Robinson MD (2014) Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res 42: e91 &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24753412&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Clustering and Embedding Methods for scRNA-Seq</title>
      <link>/assignments/projects/04_scrnaseq_embedding/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/04_scrnaseq_embedding/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;rna-seq-workflow&#34;&gt;RNA-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Map reads against reference genome&lt;/li&gt;
&lt;li&gt;Perform read counting for required ranges (&lt;em&gt;e.g.&lt;/em&gt; exonic gene ranges)&lt;/li&gt;
&lt;li&gt;Normalization of read counts&lt;/li&gt;
&lt;li&gt;Identification of differentially expressed genes (DEGs)&lt;/li&gt;
&lt;li&gt;Clustering of gene expression profiles&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-project&#34;&gt;Challenge Project&lt;/h2&gt;
&lt;h3 id=&#34;clustering-and-embedding-methods-for-scrna-seq&#34;&gt;Clustering and Embedding Methods for scRNA-Seq&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Run the above workflow from start to finish (steps 1-7) on the full RNA-Seq data set from Howard et al. (2013).&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Group 1 and 2 compare the partition performance of at least 3 clustering and 3 embedding methods, respectively, for high-dimensional gene expression data using single cell RNA-Seq data.&lt;/li&gt;
&lt;li&gt;The clustering methods can include SC3, TSCAM, Seurat, PCAkmeans, etc (for additional methods, see table 3 in Duò et al, 2018).&lt;/li&gt;
&lt;li&gt;The dimensionality reduction methods can include PCA, MDS, &lt;a href=&#34;http://bioconductor.org/packages/release/bioc/html/SC3.html&#34;&gt;SC3&lt;/a&gt;, &lt;a href=&#34;https://bioconductor.org/packages/release/bioc/html/RDRToolbox.html&#34;&gt;isomap&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/web/packages/Rtsne/&#34;&gt;t-SNE&lt;/a&gt;, &lt;a href=&#34;https://github.com/KlugerLab/FIt-SNE&#34;&gt;FIt-SNE&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/web/packages/umap/index.html&#34;&gt;UMAP&lt;/a&gt;, &lt;a href=&#34;https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html&#34;&gt;runUMAP in scater Bioc package&lt;/a&gt;, etc.&lt;/li&gt;
&lt;li&gt;To obtain meaningful test results, choose an scRNA-Seq data set (here pre-processed count data) where the correct cell clustering is known (ground truth). For simplicity the data could be obtained from the &lt;a href=&#34;https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html&#34;&gt;scRNAseq&lt;/a&gt; package (Risso and Cole, 2020) or loaded from GEO (e.g. Shulse et al., 2019). For learning purposes, organize the data in a &lt;a href=&#34;https://bioconductor.org/packages/3.12/bioc/html/SingleCellExperiment.html&#34;&gt;SingleCellExperiment&lt;/a&gt; object. How to work with &lt;code&gt;SingleCellExperiment&lt;/code&gt; objects with embedding methods like t-SNE, the tutorial (&lt;a href=&#34;https://bioconductor.org/packages/3.12/bioc/vignettes/scran/inst/doc/scran.html&#34;&gt;here&lt;/a&gt;) of the scran package provides an excellent introduction.&lt;/li&gt;
&lt;li&gt;Optional: plot the (partitioning) performance in the form of ROC curves and/or record their AUC values.&lt;/li&gt;
&lt;li&gt;Compare your test results with published performance test results, e.g. Sun et al. (2019) or Duò et al. (2018).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Duò A, Robinson MD, Soneson C (2018) A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res 7: 1141. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/30271584/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Howard, B.E. et al., 2013. High-throughput RNA sequencing of pseudomonas-infected Arabidopsis reveals hidden transcriptome complexity and novel splice variants. PloS one, 8(10), p.e74183. &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24098335&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al (2017) SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14: 483–486. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/28346451/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9 (Nov) : 2579-2605, 2008.&lt;/li&gt;
&lt;li&gt;Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y (2019) Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat Methods 16: 243–245 &lt;a href=&#34;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6402590/&#34;&gt;PubMed&lt;/a&gt; (Note: this could be used as a more recent pub on t-SNE; the speed improved version is also available for R with a C)&lt;/li&gt;
&lt;li&gt;McInnes L, Healy J, Melville J (2018) UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. &lt;a href=&#34;https://arxiv.org/abs/1802.03426&#34;&gt;arXiv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Risso D, Cole M (2020). scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets. R package version 2.4.0. -&amp;gt; Choose one scRNA-Seq data set from this Bioc data package for testing embedding methods. &lt;a href=&#34;https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html&#34;&gt;URL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Senabouth A, Lukowski SW, Hernandez JA, Andersen SB, Mei X, Nguyen QH, Powell JE (2019) ascend: R package for analysis of single-cell RNA-seq data. Gigascience. doi: 10.1093/gigascience/giz087. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/31505654/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Shulse CN, Cole BJ, Ciobanu D, Lin J, Yoshinaga Y, Gouran M, Turco GM, Zhu Y, O’Malley RC, Brady SM, et al (2019) High-Throughput Single-Cell Transcriptome Profiling of Plant Cell Types. Cell Rep 27: 2241–2247.e4 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/31091459/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Sun S, Zhu J, Ma Y, Zhou X (2019) Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol 20: 269. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/31823809/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Sun S, Zhu J, Zhou X (2020) Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat Methods. doi: 10.1038/s41592-019-0701-7. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/31988518/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: ChIP-Seq Peak Callers</title>
      <link>/assignments/projects/05_chipseq_peakcaller/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/05_chipseq_peakcaller/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;chip-seq-workflow&#34;&gt;ChIP-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Align reads to reference genome&lt;/li&gt;
&lt;li&gt;Compute read coverage across genome&lt;/li&gt;
&lt;li&gt;Peak calling with different methods and consensus peak identification&lt;/li&gt;
&lt;li&gt;Annotate peaks&lt;/li&gt;
&lt;li&gt;Differential binding analysis&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;li&gt;Motif prediction to identify putative TF binding sites&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-projects&#34;&gt;Challenge Projects&lt;/h2&gt;
&lt;h3 id=&#34;1-comparison-of-peak-calling-methods&#34;&gt;1. Comparison of peak calling methods&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Call peaks with at least 2-3 software tools, such as MACS2, &lt;code&gt;slice&lt;/code&gt; coverage calling (Bioc), PeakSeq, F-Seq, Homer, ChIPseqR, or CSAR.&lt;/li&gt;
&lt;li&gt;Compare the results with peaks identified by Kaufmann et al (2010)&lt;/li&gt;
&lt;li&gt;Report unique and common peaks among three methods and plot the results as venn diagrams&lt;/li&gt;
&lt;li&gt;Plot the performance of the peak callers in form of ROC plots. As true result set one can use the intersect of the peaks identified by all methods.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-comparison-of-peak-calling-methods&#34;&gt;2. Comparison of peak calling methods&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Similar as above but with different combination of peak calling methods and/or performance testing approach.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Feng J, Liu T, Qin B, Zhang Y, Liu XS (2012) Identifying ChIP-seq enrichment using MACS. Nat Protoc 7: 1728–1740. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/22936215/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20360106/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, et al (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 22: 1813–1831. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/22955991/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Lun ATL, Smyth GK (2014) De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. Nucleic Acids Res 42: e95. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/24852250/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Muiño JM, Kaufmann K, van Ham RC, Angenent GC, Krajewski P (2011) ChIP-seq Analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions. Plant Methods 7: 11. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/21554688/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Wilbanks EG, Facciotti MT (2010) Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. doi: 10.1371/journal.pone.0011471. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20628599/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W, et al (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol. doi: 10.1186/gb-2008-9-9-r137. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/18798982/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Functional enrichment analysis (FEA)</title>
      <link>/assignments/projects/06_functional_enrichment/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/06_functional_enrichment/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;chip-seq-workflow&#34;&gt;ChIP-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Align reads to reference genome&lt;/li&gt;
&lt;li&gt;Compute read coverage across genome&lt;/li&gt;
&lt;li&gt;Peak calling with different methods and consensus peak identification&lt;/li&gt;
&lt;li&gt;Annotate peaks&lt;/li&gt;
&lt;li&gt;Differential binding analysis&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;li&gt;Motif prediction to identify putative TF binding sites&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-project-functional-enrichment-analysis-fea&#34;&gt;Challenge Project: Functional enrichment analysis (FEA)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Perform functional enrichment analysis on the genes overlapping or downstream of the peak ranges discovered by the ChIP-Seq workflow.&lt;/li&gt;
&lt;li&gt;Compare at least 2 functional enrichment methods (&lt;em&gt;e.g.&lt;/em&gt; &lt;a href=&#34;http://bioconductor.org/packages/devel/bioc/html/systemPipeR.html&#34;&gt;GOCluster_Report&lt;/a&gt;, &lt;a href=&#34;https://bioconductor.org/packages/3.12/bioc/html/fgsea.html&#34;&gt;fgsea&lt;/a&gt;, chipenrich, goseq, GOstats) using KEGG/Reactome or Gene Ontology as functional annotation systems. Among the FEA methods include one based on  the hypergeometric distribution (ORA) and one on the Gene Set Enrichment Analysis (GSEA) algorithm. Assess the results as follows:
&lt;ul&gt;
&lt;li&gt;Quantify the rank-based similarities of the functional categories among the chosen enrichment methods.&lt;/li&gt;
&lt;li&gt;Determine whether the enrichment results match the biological expectations of the experiment (e.g. are certain biological processes enriched)?&lt;/li&gt;
&lt;li&gt;Optional: visualize the results with one of the pathway or GO graph viewing tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20360106/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Sergushichev A (2016) An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation. &lt;a href=&#34;https://www.biorxiv.org/content/10.1101/060012v3&#34;&gt;bioRxiv 060012&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102: 15545–15550. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/16199517/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Welch RP, Lee C, Imbriano PM, Patil S, Weymouth TE, Smith RA, Scott LJ, Sartor MA (2014) ChIP-Enrich: gene set enrichment testing for ChIP-seq data. Nucleic Acids Res 42: e105. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/24878920/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 11: R14. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20132535/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Motif Enrichment Analysis (MEA)</title>
      <link>/assignments/projects/07_motif_enrichment/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/07_motif_enrichment/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;chip-seq-workflow&#34;&gt;ChIP-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Align reads to reference genome&lt;/li&gt;
&lt;li&gt;Compute read coverage across genome&lt;/li&gt;
&lt;li&gt;Peak calling with different methods and consensus peak identification&lt;/li&gt;
&lt;li&gt;Annotate peaks&lt;/li&gt;
&lt;li&gt;Differential binding analysis&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;li&gt;Motif prediction to identify putative TF binding sites&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-projects&#34;&gt;Challenge Projects&lt;/h2&gt;
&lt;h3 id=&#34;1-motif-enrichment&#34;&gt;1. Motif enrichment&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Prioritize/rank peaks by FDR from differential binding analysis&lt;/li&gt;
&lt;li&gt;Parse peak sequences from genome&lt;/li&gt;
&lt;li&gt;Determine which motifs in the Jaspar database (&lt;a href=&#34;http://bioconductor.org/packages/release/bioc/html/MotifDb.html&#34;&gt;motifDB&lt;/a&gt;) show the highest enrichment in the peak sequences. The motif enrichment tests can be performed with the &lt;a href=&#34;http://bioconductor.org/packages/release/bioc/html/PWMEnrich.html&#34;&gt;PWMEnrich&lt;/a&gt; package. Basic starter code for accomplishing these tasks is provided &lt;a href=&#34;https://gist.github.com/tgirke/df6fe20c2e42e71a7ade04941d4a05e9&#34;&gt;here&lt;/a&gt;. The motif mapping can be performed with &lt;a href=&#34;http://bioconductor.org/packages/3.12/bioc/html/Biostrings.html&#34;&gt;matchPWM&lt;/a&gt; or &lt;a href=&#34;http://bioconductor.org/packages/3.12/bioc/html/motifmatchr.html&#34;&gt;motifmatcher&lt;/a&gt;, and motif identification in databases can be performed with &lt;a href=&#34;https://bioconductor.org/packages/3.12/bioc/html/MotIV.html&#34;&gt;MotIV&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;To have distinct challenge project aspects for each of the two students in this project, one could use different peak ranking approaches, e.g. one ranks by FDR of differential binding analysis, and the other by coverage or p-values of peak caller.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-motif-discovery&#34;&gt;2. Motif discovery&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Use peaks discovered in workflow (step 1-7 above) for motif discovery&lt;/li&gt;
&lt;li&gt;Run discovery with at least two motif discovery tools (MEMEchip and BCRANK)&lt;/li&gt;
&lt;li&gt;Identify motifs that are identified by at least two discovery tools&lt;/li&gt;
&lt;li&gt;Identify motifs that are most similar to those reported by Kaufman et al. (2020) paper&lt;/li&gt;
&lt;li&gt;Optional: compare with known motifs in Jasper database&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Frith, Martin C., Yutao Fu, Liqun Yu, Jiang‐fan Chen, Ulla Hansen, and Zhiping Weng. 2004. “Detection of Functional DNA Motifs via Statistical Over‐representation.” Nucleic Acids Research 32 (4): 1372–81. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/14988425/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20360106/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Machanick P, Bailey TL (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27: 1696–1697. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/21486936/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;McLeay, Robert C, and Timothy L Bailey. 2010. “Motif Enrichment Analysis: A Unified Framework and an Evaluation on ChIP Data.” BMC Bioinformatics 11: 165. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20356413/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Tompa, M, N Li, T L Bailey, G M Church, B De Moor, E Eskin, A V Favorov, et al. 2005. “Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites.” Nature Biotechnology 23 (1): 137–44. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/15637633/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/26213851/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Drug-target analysis</title>
      <link>/assignments/projects/09_drug_target_analysis/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/09_drug_target_analysis/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;chip-seq-workflow&#34;&gt;ChIP-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Align reads to reference genome&lt;/li&gt;
&lt;li&gt;Compute read coverage across genome&lt;/li&gt;
&lt;li&gt;Peak calling with different methods and consensus peak identification&lt;/li&gt;
&lt;li&gt;Annotate peaks&lt;/li&gt;
&lt;li&gt;Differential binding analysis&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;li&gt;Motif prediction to identify putative TF binding sites&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-project-drug-target-analysis-of-proteins-encoded-by-genes-in-peak-regions&#34;&gt;Challenge Project: Drug-target analysis of proteins encoded by genes in peak regions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Identify protein coding genes in peak regions&lt;/li&gt;
&lt;li&gt;Identify corresponding human orthologs&lt;/li&gt;
&lt;li&gt;Perform drug-target annotation analysis, e.g. with &lt;a href=&#34;https://bioconductor.org/packages/release/bioc/html/drugTargetInteractions.html&#34;&gt;drugTargetInteractions&lt;/a&gt; package&lt;/li&gt;
&lt;li&gt;Identify similar drugs with two different structural similarity search algorithms (e.g. 2 fingerprint methods)&lt;/li&gt;
&lt;li&gt;Challenge question: which of the two structural similarity search tools identifies more similar small molecules that have annotated protein targets
in ChEMBL (DrugBank). Explore options on how to visualize the performance results.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Chen X, Reynolds CH (2002) Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J Chem Inf Comput Sci 42: 1407–1414 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/12444738/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20360106/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Genome Summary Graphics</title>
      <link>/assignments/projects/08_genome_graphics/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/08_genome_graphics/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;chip-seq-workflow&#34;&gt;ChIP-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Align reads to reference genome&lt;/li&gt;
&lt;li&gt;Compute read coverage across genome&lt;/li&gt;
&lt;li&gt;Peak calling with different methods and consensus peak identification&lt;/li&gt;
&lt;li&gt;Annotate peaks&lt;/li&gt;
&lt;li&gt;Differential binding analysis&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;li&gt;Motif prediction to identify putative TF binding sites&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-project-programmable-graphics-for-visualizing-genomic-features&#34;&gt;Challenge Project: Programmable graphics for visualizing genomic features&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;This project focuses on the visualization of patterns in NGS experiments (&lt;em&gt;e.g.&lt;/em&gt; consensus motifs in ChIP-Seq peaks) to discover novel features in genomes. The visualization backend should be based on one of the programmable and extendable R/Bioconductor environments such as &lt;a href=&#34;https://ggplot2.tidyverse.org/&#34;&gt;ggplot2&lt;/a&gt; (&lt;a href=&#34;https://plotly.com/ggplot2/&#34;&gt;ggplotly&lt;/a&gt;), &lt;a href=&#34;https://www.bioconductor.org/packages/release/bioc/html/ggbio.html&#34;&gt;ggbio&lt;/a&gt;, &lt;a href=&#34;https://bioconductor.org/packages/release/bioc/html/Gviz.html&#34;&gt;Gviz&lt;/a&gt;, &lt;a href=&#34;https://cran.r-project.org/web/packages/RCircos/index.html&#34;&gt;RCircos&lt;/a&gt;, etc. For instance, this could include:
&lt;ul&gt;
&lt;li&gt;The generation of motif logos (&lt;em&gt;e.g.&lt;/em&gt; for ChIP-Seq peaks) for any number of sequence ranges of interest.&lt;/li&gt;
&lt;li&gt;Integration of the results with functional annotation information (&lt;em&gt;e.g.&lt;/em&gt; protein families from Pfam, exonic regions coding for disordered structures), pathways and/or GO.&lt;/li&gt;
&lt;li&gt;Incorporation of quantitative information such as relative or differential abundance information obtained from the corresponding NGS profiling technology.&lt;/li&gt;
&lt;li&gt;If there is interest, a Shiny App could be included to run the developed R functions interactively from a web browser.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Hahne F, Ivanek R (2016). “Statistical Genomics: Methods and Protocols.” In Mathé E, Davis S (eds.), chapter Visualizing Genomic Data Using Gviz and Bioconductor, 335–351. Springer New York, New York, NY. ISBN 978-1-4939-3578-9, doi: 10.1007/978-1-4939-3578-9_16. &lt;a href=&#34;https://link.springer.com/protocol/10.1007%2F978-1-4939-3578-9_16&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20360106/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Yin T, Cook D, Lawrence M (2012). “ggbio: an R package for extending the grammar of graphics for genomic data.” Genome Biology, 13(8), R77. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/22937822/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Zhang H, Meltzer P, Davis S (2013) RCircos: an R package for Circos 2D track plots. BMC Bioinformatics 14: 244–244. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/23937229/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: lncRNAs and other features</title>
      <link>/assignments/projects/10_lncrna_orf_discovery/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/10_lncrna_orf_discovery/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;chip-seq-workflow&#34;&gt;ChIP-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read quality assessment, filtering and trimming&lt;/li&gt;
&lt;li&gt;Align reads to reference genome&lt;/li&gt;
&lt;li&gt;Compute read coverage across genome&lt;/li&gt;
&lt;li&gt;Peak calling with different methods and consensus peak identification&lt;/li&gt;
&lt;li&gt;Annotate peaks&lt;/li&gt;
&lt;li&gt;Differential binding analysis&lt;/li&gt;
&lt;li&gt;Gene set enrichment analysis&lt;/li&gt;
&lt;li&gt;Motif prediction to identify putative TF binding sites&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-project-functional-enrichment-analysis-fea&#34;&gt;Challenge Project: Functional enrichment analysis (FEA)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Run workflow from start to finish (steps 1-8) on ChIP-Seq data set from Kaufman et al. (2010)&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Parses DNA sequences of identified peak footprints&lt;/li&gt;
&lt;li&gt;Identify in the identified peak sequences 1-2 of the following feature types:
&lt;ul&gt;
&lt;li&gt;Long non-coding RNAs (lncRNAs; Han et al., 2019; Hu et al., 2017)&lt;/li&gt;
&lt;li&gt;Open reading frames (ORFs)&lt;/li&gt;
&lt;li&gt;miRNAs&lt;/li&gt;
&lt;li&gt;Repeats&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Kaufmann, K, F Wellmer, J M Muiño, T Ferrier, S E Wuest, V Kumar, A Serrano-Mislata, et al. 2010. “Orchestration of Floral Initiation by APETALA1.” Science 328 (5974): 85–89. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/20360106/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Han S, Liang Y, Ma Q, Xu Y, Zhang Y, Du W, Wang C, Li Y (2019) LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief Bioinform 20: 2009–2027 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/30084867/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Hu L, Xu Z, Hu B, Lu ZJ (2017) COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features. Nucleic Acids Res 45: e2 &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/27608726/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Project Data Management and Run Instructions</title>
      <link>/assignments/projects/project_data/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/project_data/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;big-data-space-on-hpcc&#34;&gt;Big data space on HPCC&lt;/h2&gt;
&lt;p&gt;All larger data sets of the course projects will be organized in a big data space under
&lt;code&gt;/bigdata/gen242/&amp;lt;user_name&amp;gt;&lt;/code&gt;. Within this space, each student will work in a subdirectory named after their project:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/bigdata/gen242/&amp;lt;user_name&amp;gt;/&amp;lt;github_name&amp;gt;_project&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;project-github-repositories&#34;&gt;Project GitHub repositories&lt;/h2&gt;
&lt;p&gt;Students will work on their course projects within GitHub repositories, one for each student.
These project repositories are private and have been shared with each student.
To populate a course project with an initial project workflow, please follow the instructions
given in the following section.&lt;/p&gt;
&lt;h2 id=&#34;generate-workflow-environment-with-real-project-data&#34;&gt;Generate workflow environment with real project data&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Log in to the HPCC cluster and set your working directory to &lt;code&gt;bigdata&lt;/code&gt; or (&lt;code&gt;/bigdata/gen242/&amp;lt;user_name&amp;gt;&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Clone the GitHub repository for your project with &lt;code&gt;git clone ...&lt;/code&gt; (URLs are listed in &lt;code&gt;Course Planning&lt;/code&gt; sheet) and then &lt;code&gt;cd&lt;/code&gt; into this directory. As mentioned above, the project GitHub repos follow this naming convention: &lt;code&gt;&amp;lt;github_name&amp;gt;_project&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Generate the workflow environment for your project on the HPCC cluster with &lt;code&gt;genWorkenvir&lt;/code&gt; from &lt;code&gt;systemPipeRdata&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Next, &lt;code&gt;cd&lt;/code&gt; into the directory of your workflow, delete its default &lt;code&gt;data&lt;/code&gt; and &lt;code&gt;results&lt;/code&gt; directories, and then substitute them with empty directories outside of your project GitHub repos as follows (&lt;code&gt;&amp;lt;workflow&amp;gt;&lt;/code&gt; needs to be replaced with actual workflow name):
&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;mkdir ../../&amp;lt;workflow&amp;gt;_data
mkdir ../../&amp;lt;workflow&amp;gt;_results
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Within your workflow directory create symbolic links pointing to the new directories created in the previous step. For instance, the projects using the RNA-Seq workflow should create the symbolic links for their &lt;code&gt;data&lt;/code&gt; and &lt;code&gt;results&lt;/code&gt; directories like this (&lt;code&gt;&amp;lt;user_name&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;workflow&amp;gt;&lt;/code&gt; needs to be replaced with your user name and workflow name):
&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;ln -s /bigdata/gen242/&amp;lt;user_name&amp;gt;/&amp;lt;workflow&amp;gt;_data data
ln -s /bigdata/gen242/&amp;lt;user_name&amp;gt;/&amp;lt;workflow&amp;gt;_results results
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Add the workflow directory to the GitHub repository of your project with &lt;code&gt;git add -A&lt;/code&gt; and then run &lt;code&gt;commit&lt;/code&gt; and &lt;code&gt;push&lt;/code&gt; as outlined in the GitHub instructions of this course &lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/github/github/#github-basics-from-command-line&#34;&gt;here&lt;/a&gt;. After this check whether the workflow directory and its content shows up on your project&amp;rsquo;s online repos on GitHub. Very important: make sure that the &lt;code&gt;data&lt;/code&gt; and &lt;code&gt;results&lt;/code&gt; are empty at this point. If not investigate why and fix the problem in the corresponding step above.&lt;/li&gt;
&lt;li&gt;Download the FASTQ files of your project with &lt;code&gt;getSRAfastq&lt;/code&gt; (see below) to the &lt;code&gt;data&lt;/code&gt; directory of your project.&lt;/li&gt;
&lt;li&gt;Generate a proper &lt;code&gt;targets&lt;/code&gt; file for your project where the first column(s) point(s) to the downloaded FASTQ files. In addition, provide sample names matching the experimental design (columns: &lt;code&gt;SampleNames&lt;/code&gt; and &lt;code&gt;Factor&lt;/code&gt;). More details about the structure of targets files are provided &lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#structure-of-targets-file&#34;&gt;here&lt;/a&gt;. Ready to use targets files for the RNA-Seq, ChIP-Seq and VAR-Seq projects can be downloaded as tab separated (TSV) files from &lt;a href=&#34;https://github.com/tgirke/GEN242/tree/main/content/en/assignments/Projects/targets_files&#34;&gt;here&lt;/a&gt;. Alternatively, one can download the corresponding Google Sheets with the &lt;code&gt;read_sheet&lt;/code&gt; function from the &lt;code&gt;googlesheets4&lt;/code&gt; package (&lt;a href=&#34;https://bit.ly/2QH19Ry&#34;&gt;RNA-Seq GSheet&lt;/a&gt; and &lt;a href=&#34;https://bit.ly/2QFjTAV&#34;&gt;ChIP-Seq GSheet&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Inspect and adjust the &lt;code&gt;.param&lt;/code&gt; files you will be using. For instance, make sure the software modules you are loading and the path to the reference genome are correct.&lt;/li&gt;
&lt;li&gt;Every time you start working on your project you &lt;code&gt;cd&lt;/code&gt; into the directory of the repository and then run &lt;code&gt;git pull&lt;/code&gt; to get the latest changes. When you are done, you commit and push your changes back to GitHub with &lt;code&gt;git commit -am &amp;quot;some edits&amp;quot;; git push&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;download-of-project-data&#34;&gt;Download of project data&lt;/h2&gt;
&lt;p&gt;After logging in to one of the computer nodes via &lt;a href=&#34;https://bit.ly/3MD40DW&#34;&gt;&lt;code&gt;srun&lt;/code&gt;&lt;/a&gt;, open R from within the GitHub repository of your project and then run the following code section, but only those that apply to your project.&lt;/p&gt;
&lt;h3 id=&#34;fastq-files-from-sra&#34;&gt;FASTQ files from SRA&lt;/h3&gt;
&lt;h4 id=&#34;choose-fastq-data-for-your-project&#34;&gt;Choose FASTQ data for your project&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;The FASTQ files for the ChIP-Seq project are from SRA study &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/sra?term=SRP002174&#34;&gt;SRP002174&lt;/a&gt; (&lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/20360106&#34;&gt;Kaufman et al. 2010&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;sraidv &amp;lt;- paste(&amp;quot;SRR0388&amp;quot;, 45:51, sep=&amp;quot;&amp;quot;) 
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;The FASTQ files for the RNA-Seq project are from SRA study &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/sra?term=SRP010938&#34;&gt;SRP010938&lt;/a&gt; (&lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/24098335&#34;&gt;Howard et al. 2013&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;sraidv &amp;lt;- paste(&amp;quot;SRR4460&amp;quot;, 27:44, sep=&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;The FASTQ files for the VAR-Seq project are from SRA study &lt;a href=&#34;http://www.ncbi.nlm.nih.gov/sra?term=SRP008819&#34;&gt;SRP008819&lt;/a&gt; and &lt;a href=&#34;https://www.ncbi.nlm.nih.gov/sra?term=SRP007172&#34;&gt;SRP007172&lt;/a&gt; (&lt;a href=&#34;http://www.ncbi.nlm.nih.gov/pubmed/22106370&#34;&gt;Lu et al 2012&lt;/a&gt;). Work only with one of the two studies by using the corresponding targets file (see above).&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;sraidv &amp;lt;- c(paste(&amp;quot;SRR1051&amp;quot;, 389:415, sep=&amp;quot;&amp;quot;), c(&amp;quot;SRR352145&amp;quot;, &amp;quot;SRR279136&amp;quot;))
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&#34;load-libraries-and-modules&#34;&gt;Load libraries and modules&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(systemPipeR)                                                                                                                                                                
moduleload(&amp;quot;sratoolkit/3.0.0&amp;quot;)                                                                                                                                                      
system(&amp;quot;vdb-config --prefetch-to-cwd&amp;quot;) # sets download default to current directory                                                                                          
# system(&#39;prefetch --help&#39;) # helps to speed up fastq-dump
# system(&#39;vdb-config -i&#39;) # allows to change SRA Toolkit configuration; instructions are here: https://bit.ly/3lzfU4P
# system(&#39;fastq-dump --help&#39;) # below uses this one for backwards compatibility                                                                                                     
# system(&#39;fasterq-dump --help&#39;) # faster than fastq-dump
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&#34;define-download-function&#34;&gt;Define download function&lt;/h4&gt;
&lt;p&gt;The following function downloads and extracts the FASTQ files for each project from SRA.
Internally, it uses the &lt;code&gt;prefetch&lt;/code&gt; and &lt;code&gt;fastq-dump&lt;/code&gt; utilities of the SRA Toolkit from NCBI.
The faster &lt;code&gt;fasterq-dump&lt;/code&gt; alternative (see comment line below) is not used here for historical reasons. Note,
if you use the SRA Toolkit in your HPCC user account for the first time, then it might ask
you to configure it by running &lt;code&gt;vdb-config --interactive&lt;/code&gt; from the command-line. In the
resulting dialog, one can keep the default settings, and then save and exit. By running
prior to any FASTQ file downloads &lt;code&gt;vdb-config --prefetch-to-cwd&lt;/code&gt;, the download location will
be set to the current working directory (see above).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;getSRAfastq &amp;lt;- function(sraid, threads=1) {                                                                                                                                         
    system(paste(&amp;quot;prefetch&amp;quot;, sraid)) # makes download faster                                                                                                                        
    system(paste(&amp;quot;vdb-validate&amp;quot;, sraid)) # checks integrity of the downloaded SRA file                                                                   
    system(paste(&amp;quot;fastq-dump --split-files --gzip&amp;quot;, sraid)) # gzip option makes it slower but saves storage space                                                                   
    # system(paste(&amp;quot;fasterq-dump --threads 4 --split-files --progress &amp;quot;, sraid, &amp;quot;--outdir .&amp;quot;)) # Faster alternative to fastq-dump                                                   
    unlink(x=sraid, recursive = TRUE, force = TRUE) # deletes sra download directory                                                                                                
}    
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To stop the loop after a failure is detected by &lt;code&gt;vdb-validate&lt;/code&gt;, use &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt; operator like this: &lt;code&gt;prefetch sraid &amp;amp;&amp;amp; vdb-validate sraid &amp;amp;&amp;amp; fastq-dump sraid&lt;/code&gt;.&lt;/p&gt;
&lt;h4 id=&#34;run-download&#34;&gt;Run download&lt;/h4&gt;
&lt;p&gt;Note the following performs the download in serialized mode for the chosen data set and saves the extracted FASTQ files to
the current working directory.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;mydir &amp;lt;- getwd(); setwd(&amp;quot;data&amp;quot;)
for(i in sraidv) getSRAfastq(sraid=i)
setwd(mydir)
## Check whether all FASTQ files were downloaded
downloaded_files &amp;lt;- list.files(&#39;./data&#39;, pattern=&#39;fastq.gz$&#39;)
all(sraidv %in% gsub(&amp;quot;_.*&amp;quot;, &amp;quot;&amp;quot;, downloaded_files)) # Should be TRUE
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, the download can be performed in parallelized mode with &lt;code&gt;BiocParallel&lt;/code&gt;. Please run this version only on a compute node.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;mydir &amp;lt;- getwd(); setwd(&amp;quot;data&amp;quot;)
# bplapply(sraidv, getSRAfastq, BPPARAM = MulticoreParam(workers=4))
setwd(mydir)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&#34;avoid-fastq-download&#34;&gt;Avoid FASTQ download&lt;/h4&gt;
&lt;p&gt;To save time, skip the download of the FASTQ files. Instead generate in the &lt;code&gt;data&lt;/code&gt; directory of your workflow symlinks to already downloaded FASTQ files.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;fastq_symlink &amp;lt;- function(workflow) {
    file_paths &amp;lt;- list.files(file.path(&amp;quot;/bigdata/gen242/data&amp;quot;, workflow, &amp;quot;data&amp;quot;), pattern=&#39;fastq.gz$&#39;, full.names=TRUE)
    for(i in seq_along(file_paths)) system(paste0(&amp;quot;ln -s &amp;quot;, file_paths[i], &amp;quot; ./data/&amp;quot;, basename(file_paths[i])))
}
workflow_type &amp;lt;- &amp;lt;choose: &#39;fastq_rnaseq&#39; or &#39;fastq_varseq&#39;&amp;gt; # Choose here correct workflow
fastq_symlink(workflow=workflow_type)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&#34;download-reference-genome-and-annotation&#34;&gt;Download reference genome and annotation&lt;/h3&gt;
&lt;p&gt;The following &lt;code&gt;downloadRefs&lt;/code&gt; function downloads the &lt;em&gt;Arabidopsis thaliana&lt;/em&gt; genome sequence and GFF file from the &lt;a href=&#34;ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/&#34;&gt;TAIR FTP site&lt;/a&gt;.
It also assigns consistent chromosome identifiers to make them the same among both the genome sequence and the GFF file. This is
important for many analysis routines such as the read counting in the RNA-Seq workflow.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;downloadRefs &amp;lt;- function(rerun=FALSE) {
   if(rerun==TRUE) {
        library(Biostrings)
        download.file(&amp;quot;https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz&amp;quot;, &amp;quot;./data/tair10.fasta.gz&amp;quot;)
        R.utils::gunzip(&amp;quot;./data/tair10.fasta.gz&amp;quot;)
        download.file(&amp;quot;https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gff3.gz&amp;quot;, &amp;quot;./data/tair10.gff.gz&amp;quot;)
        R.utils::gunzip(&amp;quot;./data/tair10.gff.gz&amp;quot;)
        txdb &amp;lt;- GenomicFeatures::makeTxDbFromGFF(file = &amp;quot;data/tair10.gff&amp;quot;, format = &amp;quot;gff&amp;quot;, dataSource = &amp;quot;TAIR&amp;quot;, organism = &amp;quot;Arabidopsis thaliana&amp;quot;)
        AnnotationDbi::saveDb(txdb, file=&amp;quot;./data/tair10.sqlite&amp;quot;)
        download.file(&amp;quot;https://cluster.hpcc.ucr.edu/~tgirke/Teaching/GEN242/data/tair10_functional_descriptions&amp;quot;, &amp;quot;./data/tair10_functional_descriptions&amp;quot;)
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After importing/sourcing the above function, execute it as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;downloadRefs(rerun=TRUE) 
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;workflow-rmd-file&#34;&gt;Workflow Rmd file&lt;/h2&gt;
&lt;p&gt;To run the actual data analysis workflows, each project can use the &lt;code&gt;Rmd&lt;/code&gt; file obtained from the &lt;code&gt;genWorkenvir(workflow=&#39;...&#39;)&lt;/code&gt; call directly. The RNA-Seq group might want to work
with the &lt;code&gt;sprnaseq.Rmd&lt;/code&gt; file used in the tutorial &lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/sprnaseq/sprnaseq/#workflow-environment&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;recommendations-for-running-workflows&#34;&gt;Recommendations for running workflows&lt;/h2&gt;
&lt;h3 id=&#34;run-instructions&#34;&gt;Run instructions&lt;/h3&gt;
&lt;p&gt;The following provides recommendations and additional options to consider for
running and modifying workflows. This also includes parallelization settings
for the specific data used by the class projects. Note, additional details can
be found in this and other sections of the workflow introduction tutorial
&lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#loading-workflows-from-an-r-markdown&#34;&gt;here&lt;/a&gt;.
Importantly, the following should be run from within an &lt;a href=&#34;https://bit.ly/3MD40DW&#34;&gt;&lt;code&gt;srun&lt;/code&gt;&lt;/a&gt; session.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(systemPipeR)                                                                                                                                                                
sal &amp;lt;- SPRproject() # when running a WF for first time                                                                                                                                      
sal                                                                                                                                                                                 
sal &amp;lt;- importWF(sal, file_path = &amp;quot;systemPipeRNAseq.Rmd&amp;quot;) # populates sal with WF steps defined in Rmd                                                                                                                      
sal
# sal &amp;lt;- SPRproject(resume=TRUE) # when restarting a WF, skip above steps and resume WF with this command                                                                                                                                               
getRversion() # should be 4.2.2. Note, R version can be changed with `module load ...`                                                                                                                                                     
system(&amp;quot;hostname&amp;quot;) # should return number of a compute node; if not close Nvim-R session, log in to a compute node with srun and then restart Nvim-R session                                                                                                                                                                     
# sal &amp;lt;- runWF(sal) # runs WF serialized. Not recommended since this will take much longer than parallel mode introduced below by taking advantage of resource allocation
resources &amp;lt;- list(conffile=&amp;quot;.batchtools.conf.R&amp;quot;,                                                                                                                                    
                  template=&amp;quot;batchtools.slurm.tmpl&amp;quot;,                                                                                                                                 
                  Njobs=18, # chipseq should use here number of fastq files (7 or 8)                                                                                                                                                        
                  walltime=180, ## minutes                                                                                                                                          
                  ntasks=1,                                                                                                                                                         
                  ncpus=4,                                                                                                                                                          
                  memory=4096, ## Mb                                                                                                                                                
                  partition = &amp;quot;gen242&amp;quot;,                                                                                                                                              
                  account = &amp;quot;gen242&amp;quot;
                  )                                                                                                                                                                 
## Note, some users might need to update the provided `batchtools.slurm.tmpl` file in their workflow directory by running the following download command: 
# download.file(&amp;quot;https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/batchtools.slurm.tmpl&amp;quot;, &amp;quot;batchtools.slurm.tmpl&amp;quot;)
## Alternatively, changing above &amp;quot;gen242&amp;quot; to &amp;quot;epyc&amp;quot; under partition will also work.
## For RNA-Seq project use:
sal &amp;lt;- addResources(sal, step = c(&amp;quot;preprocessing&amp;quot;, &amp;quot;trimming&amp;quot;, &amp;quot;hisat2_mapping&amp;quot;), resources = resources) # parallelizes time consuming computations assigned to `step` argument                                                                           
## For VAR-Seq project use this line instead:
# sal &amp;lt;- addResources(sal, step = c(&amp;quot;preprocessing&amp;quot;, &amp;quot;bwa_alignment&amp;quot;), resources = resources)
## For ChIP-Seq project use this line instead:
# sal &amp;lt;- addResources(sal, step = c(&amp;quot;preprocessing&amp;quot;, &amp;quot;bowtie2_alignment&amp;quot;), resources = resources)
## For VAR-Seq project use this line instead:
# sal &amp;lt;- addResources(sal, c(&amp;quot;bwa_alignment&amp;quot;), resources = resources)
sal &amp;lt;- runWF(sal) # runs entire workflow; specific steps can be executed by assigning their corresponding position numbers within the workflow to the `steps` argument (see ?runWF)                                                                                                                                                               
sal &amp;lt;- renderReport(sal) # after workflow has completed render Rmd to HTML report (default name is SPR_Report.html) and view it via web browser which requires symbolic link in your ~/.html folder. 
rmarkdown::render(&amp;quot;systemPipeRNAseq.Rmd&amp;quot;, clean=TRUE, output_format=&amp;quot;BiocStyle::html_document&amp;quot;) # Alternative approach for rendering report from Rmd file instead of sal object
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&#34;modify-a-workflow&#34;&gt;Modify a workflow&lt;/h3&gt;
&lt;p&gt;If needed one can modify existing workflow steps in a pre-populated &lt;code&gt;SYSargsList&lt;/code&gt; object, and potentially already executed WF, with the &lt;code&gt;replaceStep(sal) &amp;lt;-&lt;/code&gt; replacement function.
The following gives an example where step number 3 in a &lt;code&gt;SYSargsList&lt;/code&gt; (sal) object will be updated with modified or new code. Note, this is a generalized example where the user
needs to insert the code lines and also adjust the values assigned to the arguments: &lt;code&gt;step_name&lt;/code&gt; and &lt;code&gt;dependency&lt;/code&gt;. Additional details on this topic are available in
the corresponding section of &lt;code&gt;systemPipeR&lt;/code&gt;&amp;rsquo;s introductory tutorial &lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#how-to-modify-workflows&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;replaceStep(sal, step=3) &amp;lt;- LineWise(                                                                                                                                                        
    code = {                                                                                                                                                                        
        &amp;lt;&amp;lt; my modified code lines &amp;gt;&amp;gt;
        },                                                                                                                                                                          
    step_name = &amp;lt;&amp;lt; &amp;quot;my_step_name&amp;quot; &amp;gt;&amp;gt;,                                                                                                                                                        
    dependency = &amp;lt;&amp;lt; &amp;quot;my_dependency&amp;quot; &amp;gt;&amp;gt;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since &lt;code&gt;step_names&lt;/code&gt; need to be unique, one should avoid using the same
&lt;code&gt;step_name&lt;/code&gt; as before. If the previous name is used, a default name will be
assigned. Rerunning the assignment will then allow to assign the previous name. This
behavior is enforced for version tracking. Subsequently, one can view and check
the code changes with &lt;code&gt;codeLine()&lt;/code&gt;, and then rerun the corresponding step (here
3) as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;codeLine(stepsWF(sal)$my_step_name)
runWF(sal, steps=3)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note, any step in a workflow can only be run in isolation if its expected input exists (see &lt;code&gt;dependency&lt;/code&gt;).&lt;/p&gt;
&lt;h3 id=&#34;adding-steps-to-a-workflow&#34;&gt;Adding steps to a workflow&lt;/h3&gt;
&lt;p&gt;New steps can be added to the Rmd file of a workflow by inserting new R Markdown code chunks starting and ending with the usual &lt;code&gt;appendStep&amp;lt;-&lt;/code&gt; syntax and then creating a new
&lt;code&gt;SYSargsList&lt;/code&gt; instance with &lt;code&gt;importWF&lt;/code&gt; that contain the new step(s). To add steps to a pre-populated &lt;code&gt;SYSargsList&lt;/code&gt; object, one can use the &lt;code&gt;after&lt;/code&gt; argument of the &lt;code&gt;appendStep&amp;lt;-&lt;/code&gt;
function. The following example will add a new step after position 3 to the corresponding &lt;code&gt;sal&lt;/code&gt; object. This can be useful if a longer workflow has already been completed and
one only wants to make some refinements without re-running the entire workflow.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;appendStep(sal, after=3) &amp;lt;- &amp;lt;&amp;lt; my_step_code &amp;gt;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&#34;submit-workflow-from-command-line-to-cluster&#34;&gt;Submit workflow from command-line to cluster&lt;/h3&gt;
&lt;p&gt;In addition to running workflows within interactive R sessions, after logging in to a computer node with &lt;code&gt;srun&lt;/code&gt;, one
can execute them entirely from the command-line by including the relevant workflow run instructions in an R script.
The R script can then be submitted via a Slurm submission script to the cluster. The following gives an example for the
RNA-Seq workflow (ChIP-Seq version requires only minor adjustments). Additional details on this topic are available in
the corresponding section of &lt;code&gt;systemPipeR&lt;/code&gt;&amp;rsquo;s introductory tutorial
&lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systempiper/#run-from-command-line-without-cluster&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;R script: &lt;a href=&#34;https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/wf_run_script.R&#34;&gt;wf_run_script.R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Slurm submission script: &lt;a href=&#34;https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/wf_run_script.sh&#34;&gt;wf_run_script.sh&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To test this out, users can generate in their user account of the cluster a workflow environment populated with the toy data
as outlined &lt;a href=&#34;https://girke.bioinformatics.ucr.edu/GEN242/tutorials/sprnaseq/sprnaseq/#workflow-environment&#34;&gt;here&lt;/a&gt;). After
this, it is best to create within the workflow directory a subdirectory, e.g. called &lt;code&gt;cl_sbatch_run&lt;/code&gt;, and then save the above
two files (&lt;code&gt;*.R&lt;/code&gt; and &lt;code&gt;*.sh&lt;/code&gt;) to this subdirectory. Next, the parameters in both files need to be adjusted to match the type of workflow and
the required computing resources. This includes the name of the &lt;code&gt;Rmd&lt;/code&gt; file and scheduler resource settings such as:
&lt;code&gt;partition&lt;/code&gt;, &lt;code&gt;Njobs&lt;/code&gt;, &lt;code&gt;walltime&lt;/code&gt;, &lt;code&gt;memory&lt;/code&gt;, etc. After all relevant settings have been set correctly, one can
execute the workflow with &lt;code&gt;sbatch&lt;/code&gt; within the &lt;code&gt;cl_sbatch_run&lt;/code&gt; directory as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;sbatch wf_run_script.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note, some users might need to replace in the root directory of their workflow the default &lt;code&gt;batchtools.slurm.tmpl&lt;/code&gt; file with this upated version &lt;a href=&#34;https://raw.githubusercontent.com/tgirke/GEN242/main/static/custom/spWFtemplates/cl_sbatch_run/batchtools.slurm.tmpl&#34;&gt;here&lt;/a&gt;. Alterntively, changing in the &lt;code&gt;wf_run_script.R&lt;/code&gt; &amp;ldquo;gen242&amp;rdquo; to &amp;ldquo;epyc&amp;rdquo; under the &lt;code&gt;partition&lt;/code&gt; argument will also work. After the submission to the cluster, one usually should check its status and progress with &lt;code&gt;squeue -u &amp;lt;username&amp;gt;&lt;/code&gt; as well as
by monitoring the content of the &lt;code&gt;slurm-&amp;lt;jobid&amp;gt;.out&lt;/code&gt; file generated by the scheduler in the same directory. This file
contains most of the &lt;code&gt;STDOUT&lt;/code&gt; and &lt;code&gt;STDERROR&lt;/code&gt; generated by a cluster job. Once everything is working on the toy data, users can run the workflow on the real data the same way.&lt;/p&gt;
&lt;p&gt;Detailed step-by-step instruction for running the workflows from the command-line are provided in this &lt;a href=&#34;https://raw.githubusercontent.com/tgirke/GEN242/main/content/en/assignments/Projects/helper_code/wf_run_from_cl.R&#34;&gt;wf_run_from_cl.R&lt;/a&gt; script.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: Compare Performance of Variant Callers</title>
      <link>/assignments/projects/11_varseq/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/11_varseq/</guid>
      <description>
        
        
        &lt;p&gt;&lt;br&gt;&lt;/br&gt;&lt;/p&gt;
&lt;h2 id=&#34;var-seq-workflow&#34;&gt;VAR-Seq Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Read preprocessing: filtering, quality trimming&lt;/li&gt;
&lt;li&gt;Alignments&lt;/li&gt;
&lt;li&gt;Alignment statistics&lt;/li&gt;
&lt;li&gt;Variant calling: focus of challenge project&lt;/li&gt;
&lt;li&gt;Variant filtering&lt;/li&gt;
&lt;li&gt;Variant annotation&lt;/li&gt;
&lt;li&gt;Combine results from many samples&lt;/li&gt;
&lt;li&gt;Summary statistics of samples&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;challenge-project-performance-comparisons-of-variant-callers&#34;&gt;Challenge Project: Performance Comparisons of Variant Callers&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Run the workflow from start to finish (steps 1-8) on the VAR-Seq data set from
on the data set from Lu &lt;em&gt;et al&lt;/em&gt; (2012).&lt;/li&gt;
&lt;li&gt;Challenge project tasks
&lt;ul&gt;
&lt;li&gt;Compare the performance of at least 2 variant callers, e.g. GATK, BCFtools, Octopus and DeepVariant. Include in your comparisons the following analysis/visualization steps (Barbitoff et al 2022; Cooke et al 2021; Li, 2011; Poplin et al 2018).
&lt;ol&gt;
&lt;li&gt;Report unique and common variants identified by tested variant callers.&lt;/li&gt;
&lt;li&gt;Compare the results from (1) with the variants identified by Lu et al, 2012&lt;/li&gt;
&lt;li&gt;Plot results from 1.-2. as venn diagrams or similar (&lt;em&gt;e.g.&lt;/em&gt; upset plots)&lt;/li&gt;
&lt;li&gt;If there is enough time and interest, plot the performance of the variant callers in the form of ROC plots and calculate AUC values. As pseudo ground truth, one can either use the published variants or the union of the variants identified by all methods.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV (2022) Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics 23: 155. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/35193511/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Cooke DP, Wedge DC, Lunter G (2021) A unified haplotype-based method for accurate and comprehensive variant calling. Nat Biotechnol 39: 885–892. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/33782612/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: 491–498. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/21478889/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987–2993. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/21903627/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Lu P, Han X, Qi J, Yang J, Wijeratne AJ, Li T, Ma H (2012) Analysis of Arabidopsis genome-wide variations before and after meiosis and meiotic recombination by resequencing Landsberg erecta and all four products of a single meiosis. Genome Res 22: 508–518. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/22106370/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Poplin R, Chang P-C, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT, et al (2018) A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36: 983–987. &lt;a href=&#34;https://pubmed.ncbi.nlm.nih.gov/30247488/&#34;&gt;PubMed&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Assignments: </title>
      <link>/assignments/projects/helper_code/aligners/star_test/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/assignments/projects/helper_code/aligners/star_test/</guid>
      <description>
        
        
        &lt;p&gt;#############&lt;/p&gt;
&lt;h4 id=&#34;star&#34;&gt;STAR&lt;/h4&gt;
&lt;p&gt;#############&lt;/p&gt;
&lt;h3 id=&#34;read-mapping-with-star&#34;&gt;Read mapping with &lt;code&gt;STAR&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;library(systemPipeR)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: Rsamtools
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: GenomeInfoDb
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: BiocGenerics
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &#39;BiocGenerics&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following objects are masked from &#39;package:stats&#39;:
## 
##     IQR, mad, sd, var, xtabs
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following objects are masked from &#39;package:base&#39;:
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: S4Vectors
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: stats4
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &#39;S4Vectors&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following object is masked from &#39;package:utils&#39;:
## 
##     findMatches
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following objects are masked from &#39;package:base&#39;:
## 
##     expand.grid, I, unname
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: IRanges
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: GenomicRanges
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: Biostrings
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: XVector
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &#39;Biostrings&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following object is masked from &#39;package:base&#39;:
## 
##     strsplit
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: ShortRead
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: BiocParallel
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: GenomicAlignments
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: SummarizedExperiment
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: MatrixGenerics
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: matrixStats
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &#39;MatrixGenerics&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following objects are masked from &#39;package:matrixStats&#39;:
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: Biobase
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     &#39;browseVignettes()&#39;. To cite Bioconductor, see
##     &#39;citation(&amp;quot;Biobase&amp;quot;)&#39;, and for packages &#39;citation(&amp;quot;pkgname&amp;quot;)&#39;.
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &#39;Biobase&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following object is masked from &#39;package:MatrixGenerics&#39;:
## 
##     rowMedians
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following objects are masked from &#39;package:matrixStats&#39;:
## 
##     anyMissing, rowMedians
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;# sal_test &amp;lt;- SPRproject(logs.dir= &amp;quot;.SPRproject_test&amp;quot;) # use this line when .SPRproject_test doesn&#39;t exist yet
sal_test &amp;lt;- SPRproject(overwrite = TRUE, logs.dir= &amp;quot;.SPRproject_test&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Creating directory:  /home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/data 
## Creating directory:  /home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/results 
## Creating directory &#39;/home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/.SPRproject_test&#39;
## Creating file &#39;/home/tgirke/tmp/GEN242/content/en/assignments/Projects/helper_code/aligners/.SPRproject_test/SYSargsList.yml&#39;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;appendStep(sal_test) &amp;lt;- LineWise(code = {
                library(systemPipeR)
                }, step_name = &amp;quot;load_SPR&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;read-preprocessing&#34;&gt;Read preprocessing&lt;/h2&gt;
&lt;h3 id=&#34;preprocessing-with-preprocessreads-function&#34;&gt;Preprocessing with &lt;code&gt;preprocessReads&lt;/code&gt; function&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;appendStep(sal_test) &amp;lt;- SYSargsList(
    step_name = &amp;quot;preprocessing&amp;quot;,
    targets = &amp;quot;targetsPE.txt&amp;quot;, dir = TRUE,
    wf_file = &amp;quot;preprocessReads/preprocessReads-pe.cwl&amp;quot;,
    input_file = &amp;quot;preprocessReads/preprocessReads-pe.yml&amp;quot;,
    dir_path = system.file(&amp;quot;extdata/cwl&amp;quot;, package = &amp;quot;systemPipeR&amp;quot;),
    inputvars = c(
        FileName1 = &amp;quot;_FASTQ_PATH1_&amp;quot;,
        FileName2 = &amp;quot;_FASTQ_PATH2_&amp;quot;,
        SampleName = &amp;quot;_SampleName_&amp;quot;
    ),
    dependency = c(&amp;quot;load_SPR&amp;quot;))
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;alignments-with-star&#34;&gt;Alignments with &lt;code&gt;STAR&lt;/code&gt;&lt;/h2&gt;
&lt;h3 id=&#34;star-indexing&#34;&gt;&lt;code&gt;STAR&lt;/code&gt; Indexing&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;appendStep(sal_test) &amp;lt;- SYSargsList(
    step_name = &amp;quot;star_index&amp;quot;, 
    dir = FALSE, 
    targets=NULL, 
    wf_file = &amp;quot;star/star-index.cwl&amp;quot;, 
    input_file=&amp;quot;star/star-index.yml&amp;quot;,
    dir_path=&amp;quot;param/cwl&amp;quot;, 
    dependency = &amp;quot;load_SPR&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&#34;star-mapping&#34;&gt;&lt;code&gt;STAR&lt;/code&gt; mapping&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;appendStep(sal_test) &amp;lt;- SYSargsList(
    step_name = &amp;quot;star_mapping&amp;quot;,
    dir = TRUE, 
    targets =&amp;quot;preprocessing&amp;quot;, 
    wf_file = &amp;quot;star-mapping-pe.cwl&amp;quot;,
    input_file = &amp;quot;star-mapping-pe.yml&amp;quot;,
    dir_path = &amp;quot;param/star_test&amp;quot;,
    inputvars = c(preprocessReads_1 = &amp;quot;_FASTQ_PATH1_&amp;quot;, preprocessReads_2 = &amp;quot;_FASTQ_PATH2_&amp;quot;, 
                  SampleName = &amp;quot;_SampleName_&amp;quot;),
    rm_targets_col = c(&amp;quot;FileName1&amp;quot;, &amp;quot;FileName2&amp;quot;), 
    dependency = c(&amp;quot;preprocessing&amp;quot;, &amp;quot;star_index&amp;quot;)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;## Return command-line calls for STAR
cmdlist(sal_test, step=&amp;quot;star_mapping&amp;quot;, targets=1)

## BAM outpaths required for read counting below
outpaths &amp;lt;- getColumn(sal_test, step = &amp;quot;star_mapping&amp;quot;, &amp;quot;outfiles&amp;quot;, column = &amp;quot;Aligned_toTranscriptome_out_bam&amp;quot;)
file.exists(outpaths) # Will not return TRUE until STAR completed sucessfully
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;## To run sal_test stepwise, make sure you have constructed your 
## sal_test object step-by-step starting from an empty sal_test
## as shown above under chunk: intialize_sal_for_testing 
sal_test &amp;lt;- runWF(sal_test, steps=c(1)) # increment step number one by one just for checking
sal_test
outpaths &amp;lt;- getColumn(sal_test, step = &amp;quot;star_mapping&amp;quot;, &amp;quot;outfiles&amp;quot;, column = &amp;quot;Aligned_toTranscriptome_out_bam&amp;quot;)
outpaths
file.exists(outpaths) # Will not return TRUE until STAR completed sucessfully
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;## The following can be used for setting up things initial testing
starPE &amp;lt;- loadWorkflow(targets = &amp;quot;targetsPE.txt&amp;quot;, wf_file = &amp;quot;star-mapping-pe.cwl&amp;quot;, 
                       input_file = &amp;quot;star-mapping-pe.yml&amp;quot;, dir_path = &amp;quot;./param/star_test&amp;quot;)
starPE &amp;lt;- renderWF(starPE, inputvars = c(FileName1 = &amp;quot;_FASTQ_PATH1_&amp;quot;, FileName2 = &amp;quot;_FASTQ_PATH2_&amp;quot;, 
                                         SampleName = &amp;quot;_SampleName_&amp;quot;))
cmdlist(starPE)
runCommandline(starPE, make_bam = FALSE)
&lt;/code&gt;&lt;/pre&gt;

      </description>
    </item>
    
  </channel>
</rss>
