HW06 — RNA-Seq Analysis

Source code downloads: [ .qmd ]

Overview

This homework covers read counting strategies and differential expression analysis (DEG) from the RNA-Seq Workflow tutorial. It has three parts:

  • Part A — Compare unstranded and strand-specific read counting with featureCounts
  • Part B — Count reads over different genomic feature types with featureCounts
  • Part C — DEG analysis with edgeR and Venn diagram visualization

All tasks build on the RNA-Seq workflow environment generated with systemPipeRdata::genWorkenvir(workflow = "rnaseq"). Complete the workflow through the HISAT2 mapping step before starting this homework — the sorted BAM files and sal object it produces are required as input throughout.

Required packages

library(Rsubread)        # featureCounts
library(GenomicFeatures) # loadDb, exonsBy, intronsByTranscript, fiveUTRsByTranscript
library(systemPipeR)     # run_edgeR, filterDEGs, overLapper, vennPlot
library(edgeR)

Required input objects

Run the following once inside your rnaseq workflow directory to prepare shared inputs used across all tasks. The TxDb object (tair10.sqlite) is created from the GFF annotation file during the RNA-Seq workflow and saved to disk — load it with loadDb(). All annotation ranges for featureCounts are derived from this TxDb.

## Load TxDb created during the RNA-Seq workflow
txdb <- loadDb("./data/tair10.sqlite")

## Exons grouped by gene — used as annotation for featureCounts
## in Parts A and B (gene/exon feature types)
eByg <- exonsBy(txdb, by = "gene")

## Paths to sorted BAM files produced by HISAT2 mapping step
bam_files <- getColumn(sal, step = "hisat2_mapping",
                       "outfiles", column = "samtools_sort_bam")
## Alternative if sal object is not available:
# bam_files <- list.files("results/hisat2_mapping/",
#                         pattern = "sorted.bam$", full.names = TRUE)

A. Unstranded and Strand-Specific Read Counting

Background

featureCounts from the Rsubread package counts reads overlapping genomic features. When a GRangesList object such as eByg (exons grouped by gene) is passed via annot.ext, featureCounts treats each list element as one meta-feature (gene). This is equivalent to the exonsBy(txdb, by="gene") approach used throughout the RNA-Seq workflow and is more flexible than relying on a GTF file. Set isGTFAnnotationFile = FALSE when passing a range object. The strandSpecific argument controls strand awareness:

Mode strandSpecific Counts reads on
Unstranded 0 Both strands
Sense (positive) strand 1 Same strand as feature
Antisense (negative) strand 2 Opposite strand to feature

For paired-end RNA-Seq data always set isPairedEnd = TRUE. The count matrix is accessed from the returned list via $counts. The nthreads argument controls parallelization — adjust to the number of available cores.

Task A1 — Generate three count tables

Using bam_files as input and eByg as the annotation, generate three count tables — one per strand mode. Since eByg is already a GRangesList of exons grouped by gene, set useMetaFeatures = TRUE and isGTFAnnotationFile = FALSE.

1. Unstranded (strandSpecific = 0)

fc_unstranded <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 0,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
unstranded <- fc_unstranded$counts
unstranded[1:4, ]

2. Sense (positive) strand (strandSpecific = 1)

fc_stranded_pos <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
stranded_pos <- fc_stranded_pos$counts
stranded_pos[1:4, ]

3. Antisense (negative) strand (strandSpecific = 2)

fc_stranded_neg <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 2,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
stranded_neg <- fc_stranded_neg$counts
stranded_neg[1:4, ]

Store the count matrices in variables named exactly unstranded, stranded_pos, and stranded_neg.

Task A2 — Compare strand-specific counts to unstranded

Test whether the two strand-specific count tables sum to approximately the same values as the unstranded count table. Calculate for each strand-specific result the mean mapping frequency as a percentage of the unstranded counts:

perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE)
perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE)
perc_pos
perc_neg

Store results in variables named exactly perc_pos and perc_neg.

Interpretation: if both values are close to 50%, the RNA-Seq library is very likely unstranded. If one value dominates (e.g. ~100% sense, ~0% antisense), the library is strand-specific. For paired-end data expect very similar but not necessarily identical results between the sum of stranded counts and the unstranded total.

Task A3 — Biological interpretation of strand-specific counting

Include comment text in your script addressing the following four points:

  1. When should strand-specific counting be used instead of unstranded counting?
  2. How does strand-specific counting help resolve reads from overlapping genes encoded on opposite strands?
  3. What biological information can antisense strand counting reveal?
  4. How can comparing stranded and unstranded results determine whether an RNA-Seq experiment used a strand-specific library preparation protocol?

B. Read Counting for Different Feature Types

Background

By using different range extraction functions on txdb, reads can be counted over different genomic features (genes, exons, introns, UTRs). All annotation ranges are derived from the same TxDb object loaded from tair10.sqlite, keeping everything consistent with the RNA-Seq workflow. featureCounts accepts GRanges and GRangesList objects directly via annot.ext — set isGTFAnnotationFile = FALSE for all range-based inputs.

Task B — Count reads over five feature types

Compute sense strand (strandSpecific = 1) count tables for the following five feature types. Use isPairedEnd = TRUE for gene and exon features. For introns and UTRs use isPairedEnd = FALSE (single-end mode since these are sub-gene features).

1. Genes — exons grouped by gene (eByg from exonsBy)

## eByg already available from the input setup above
fc_genes <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
gene_ranges_countDF <- fc_genes$counts
gene_ranges_countDF[1:4, ]

2. Exons — individual exon level using exons(txdb)

exon_ranges <- exons(txdb)
fc_exons <- featureCounts(
    files                  = bam_files,
    annot.ext              = exon_ranges,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
exon_ranges_countDF <- fc_exons$counts
exon_ranges_countDF[1:4, ]

3. Exons by geneseByg is the direct equivalent of exonsBy(txdb, by="gene"); reuse it here for clarity

## eByg groups exons by gene — identical to the exonsBy approach
exonByg_ranges_countDF <- gene_ranges_countDF   # reuses result from Task B1
exonByg_ranges_countDF[1:4, ]

4. Introns by transcriptsintronsByTranscript(txdb) returns a GRangesList that is passed directly to featureCounts

intron_ranges <- intronsByTranscript(txdb, use.names = TRUE)
fc_introns <- featureCounts(
    files                  = bam_files,
    annot.ext              = intron_ranges,
    isGTFAnnotationFile    = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = FALSE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
intron_ranges_countDF <- fc_introns$counts
intron_ranges_countDF[1:4, ]

5. 5’-UTRs by transcriptsfiveUTRsByTranscript(txdb) passed directly

fiveUTR_ranges <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fc_fiveUTR <- featureCounts(
    files                  = bam_files,
    annot.ext              = fiveUTR_ranges,
    isGTFAnnotationFile    = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = FALSE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
fiveUTR_ranges_countDF <- fc_fiveUTR$counts
fiveUTR_ranges_countDF[1:4, ]

Use the exact variable names shown above: gene_ranges_countDF, exon_ranges_countDF, exonByg_ranges_countDF, intron_ranges_countDF, fiveUTR_ranges_countDF.


C. DEG Analysis with edgeR

Background

run_edgeR() from systemPipeR wraps the standard edgeR DEG workflow. Here you will run it on two count tables (unstranded vs sense strand) and compare the resulting DEG sets visually using 4-way Venn diagrams.

Task C — DEG analysis and Venn diagram comparison

1. Run edgeR on both count tables

cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]],
                format = "matrix", delim = "-")
## Alternative if sal is not available:
# cmp <- readComp(file = "targetsPE.txt", format = "matrix", delim = "-")

## DEG analysis with unstranded count table
edgeDF_unstranded <- run_edgeR(
    countDF     = unstranded,
    targets     = targetsWF(sal)[["hisat2_mapping"]],
    cmp         = cmp[[1]],
    independent = FALSE,
    mdsplot     = ""
)

## DEG analysis with sense strand count table
edgeDF_pos <- run_edgeR(
    countDF     = stranded_pos,
    targets     = targetsWF(sal)[["hisat2_mapping"]],
    cmp         = cmp[[1]],
    independent = FALSE,
    mdsplot     = ""
)

2. Generate and save 4-way Venn diagrams

## (1) Venn diagram — unstranded count table
DEG_list_unstranded <- filterDEGs(
    degDF  = edgeDF_unstranded,
    filter = c(Fold = 2, FDR = 40),
    plot   = FALSE
)
vennsetup   <- overLapper(DEG_list_unstranded$Up[6:9],   type = "vennsets")
vennsetdown <- overLapper(DEG_list_unstranded$Down[6:9], type = "vennsets")
pdf("results/DEGcount_unstranded.pdf")
vennPlot(list(vennsetup, vennsetdown),
         mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()

## (2) Venn diagram — sense strand count table
DEG_list_pos <- filterDEGs(
    degDF  = edgeDF_pos,
    filter = c(Fold = 2, FDR = 40),
    plot   = FALSE
)
vennsetup   <- overLapper(DEG_list_pos$Up[6:9],   type = "vennsets")
vennsetdown <- overLapper(DEG_list_pos$Down[6:9], type = "vennsets")
pdf("results/DEGcount_pos.pdf")
vennPlot(list(vennsetup, vennsetdown),
         mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()

Both PDF files must be committed to the repository at:

  • results/DEGcount_unstranded.pdf
  • results/DEGcount_pos.pdf

D. Bonus Challenge — edgeR in Native Mode with Time Course Design

Bonus: +1 extra point

This section is optional. Completing it earns 1 bonus point on top of the 5-point maximum, for a total possible score of 6/5. It is intentionally more challenging and is meant for students who want a deeper understanding of linear model design in RNA-Seq analysis.

Background

The run_edgeR() wrapper used in Part C is convenient but hides the underlying statistical machinery. Internally, it builds a simple pairwise design matrix and fits a generalized linear model (GLM). For the Howard et al. (2013) data set used throughout this workflow, a pairwise design is not the optimal approach — the experiment is a two-factor time course that deserves a more informative model.

The experimental design of Howard et al. (2013) consists of 18 samples:

Factor Meaning Time points Replicates
M Mock-inoculated (susceptible) 1h, 2h, 12h A, B
A Pseudomonas-infected (resistant) 1h, 2h, 12h A, B

The sample names in targetsPE.txt encode this as: M1A, M1B, M2A, M2B, M12A, M12B (mock) and A1A, A1B, A2A, A2B, A12A, A12B (infected). The two factors are treatment (mock vs infected) and time (1h, 2h, 12h).

An appropriate model for this design captures:

  • The main effect of time
  • The main effect of treatment (mock vs infected)
  • The interaction between time and treatment — i.e. whether the treatment effect changes over time, which is the biologically interesting question

Task D — Run edgeR natively with an interaction model

Using the unstranded count table from Part A as input, implement the full native edgeR workflow with a ~ treatment + time + treatment:time interaction design.

Step 1 — Prepare sample metadata

library(edgeR)

## Load targets to get sample information
targets <- read.delim("targetsPE.txt", comment.char = "#")

## Extract treatment and time factors from the Factor column
## Factor levels: M1, M2, M12 (mock) and A1, A2, A12 (infected)
targets$treatment <- ifelse(grepl("^M", targets$Factor), "mock", "infected")
targets$time      <- gsub("^[MA]", "", targets$Factor)   # extracts "1", "2", "12"
targets$time      <- factor(targets$time, levels = c("1", "2", "12"))
targets$treatment <- factor(targets$treatment, levels = c("mock", "infected"))

## Confirm the design
targets[, c("SampleName", "Factor", "treatment", "time")]

Step 2 — Create DGEList and filter low-count genes

## Match count table columns to sample order in targets
countDF <- unstranded[, targets$SampleName]

## Create DGEList object
dge <- DGEList(counts = countDF, group = targets$Factor)

## Filter genes with very low counts across all samples
## Keep genes with at least 1 count per million (CPM) in at least 2 samples
keep    <- rowSums(cpm(dge) >= 1) >= 2
dge     <- dge[keep, , keep.lib.sizes = FALSE]
message(sum(keep), " genes retained after low-count filtering")

## Normalize library sizes with TMM
dge <- calcNormFactors(dge, method = "TMM")

Step 3 — Build the interaction design matrix

## Interaction model: ~ treatment + time + treatment:time
## This tests whether the treatment effect differs across time points
design_tc <- model.matrix(~ treatment + time + treatment:time,
                          data = targets)
colnames(design_tc)
## Should show: Intercept, treatmentinfected, time2, time12,
##              treatmentinfected:time2, treatmentinfected:time12

Step 4 — Estimate dispersion and fit the GLM

## Estimate common, trended and tagwise dispersions
dge <- estimateDisp(dge, design_tc, robust = TRUE)
plotBCV(dge)   # optional: inspect biological coefficient of variation

## Fit the negative binomial GLM
fit_tc <- glmQLFit(dge, design_tc, robust = TRUE)

Step 5 — Test for genes responding differently across time in infected vs mock

Test the interaction terms — these identify genes where the treatment effect changes significantly over time (the core time course question):

## Test interaction coefficients:
## treatmentinfected:time2  and  treatmentinfected:time12
interaction_cols <- grep("treatmentinfected:time", colnames(design_tc))
qlf_interaction  <- glmQLFTest(fit_tc, coef = interaction_cols)

## Extract top DEGs
topTags(qlf_interaction, n = 20)

## Summary of significant genes at FDR < 0.05
summary(decideTests(qlf_interaction, p.value = 0.05))

## Save full results table
tc_results <- as.data.frame(topTags(qlf_interaction, n = Inf))
write.csv(tc_results, "results/edgeR_timecourse_interaction.csv")

Step 6 — Compare with pairwise result from Part C

Briefly compare (in comment text) the number of DEGs identified by the interaction model versus the pairwise approach from Part C. Consider:

  • Which approach identified more DEGs?
  • Which design better captures the biology of a time course experiment and why?
  • What biological question does each design answer?

Store your interaction results in a variable named exactly tc_results and save the CSV to results/edgeR_timecourse_interaction.csv.


Homework Submission

Submit one R script named HW6.R and the two Venn diagram PDFs to your private GitHub homework repository at the following paths:

Homework/HW6/HW6.R
results/DEGcount_unstranded.pdf
results/DEGcount_pos.pdf
results/edgeR_timecourse_interaction.csv   # bonus only

Requirements for full credit

  1. Use featureCounts() from Rsubread for all read counting (Parts A and B)
  2. Define all variables with the exact names specified throughout
  3. Include comment text for all four points of Task A3
  4. Commit both Venn diagram PDF files alongside HW6.R
  5. Script must be well structured following the recommended layout below

Bonus requirement (+1 point)

To earn the bonus point, additionally include in HW6.R:

  • The full native edgeR workflow for the interaction model (~ treatment + time + treatment:time)
  • Variable tc_results containing the full DEG table from topTags(..., n = Inf)
  • File results/edgeR_timecourse_interaction.csv committed to the repo
  • Comment text comparing the interaction model result with the pairwise result from Part C

Due Date

This homework is due Tuesday, May 20th at 6:00 PM.

Homework Solutions

To be posted after the due date.

Back to top