HW06 — RNA-Seq Analysis

Source code downloads: [ .qmd ]

Overview

This homework covers read counting strategies and differential expression analysis (DEG) from the RNA-Seq Workflow tutorial. It has three parts:

Part A — Compare unstranded and strand-specific read counting with featureCounts
Part B — Count reads over different genomic feature types with featureCounts
Part C — DEG analysis with edgeR and Venn diagram visualization

All tasks build on the RNA-Seq workflow environment generated with genWorkenvir_gh from this class-specific GitHub repos: sprwf-rnaseq-gen242-2026. Complete the workflow with runWF before starting this homework. The sorted BAM files generated in the HISAT2 mapping step and (optionally) the sal object it produces are required as input throughout.

Required packages

library(Rsubread)        # featureCounts
library(GenomicFeatures) # loadDb, exonsBy, intronsByTranscript, fiveUTRsByTranscript
library(systemPipeR)     # run_edgeR, filterDEGs, overLapper, vennPlot
library(edgeR)

Required input objects

Run the following once inside your rnaseq workflow directory to prepare shared inputs used across all tasks. The TxDb object (tair10.sqlite) is created from the GFF annotation file during the RNA-Seq workflow and saved to disk — load it with loadDb(). All annotation ranges for featureCounts are derived from this TxDb.

## Load TxDb created during the RNA-Seq workflow
txdb <- loadDb("./data/tair10.sqlite")

## Exons grouped by gene — used as annotation for featureCounts
## in Parts A and B (gene/exon feature types)
eByg <- exonsBy(txdb, by = "gene")

## Helper function: convert GRangesList/GRanges to featureCounts annotation
## data frame (required format: GeneID, Chr, Start, End, Strand).
## Use this for every featureCounts call in Parts A and B.
grl_to_df <- function(grl) {
    gr <- if (is(grl, "GRangesList")) unlist(grl) else grl
    data.frame(GeneID = names(gr),
               Chr    = as.character(seqnames(gr)),
               Start  = start(gr),
               End    = end(gr),
               Strand = as.character(strand(gr)),
               stringsAsFactors = FALSE)
}

## Paths to sorted BAM files produced by HISAT2 mapping step
bam_files <- getColumn(sal, step = "hisat2_mapping",
                       "outfiles", column = "samtools_sort_bam")
## Alternative if sal object is not available:
# bam_files <- list.files("results/hisat2_mapping/",
#                         pattern = "sorted.bam$", full.names = TRUE)
# names(bam_files) <- read.delim("targetsPE.txt", comment.char="#")$SampleName

A. Unstranded and Strand-Specific Read Counting

Background

featureCounts from the Rsubread package counts reads overlapping genomic features. When a GRangesList object such as eByg (exons grouped by gene) is passed via annot.ext, featureCounts treats each list element as one meta-feature (gene). This is equivalent to the exonsBy(txdb, by="gene") approach used throughout the RNA-Seq workflow and is more flexible than relying on a GTF file. Set isGTFAnnotationFile = FALSE when passing a range object. The strandSpecific argument controls strand awareness:

Mode	`strandSpecific`	Counts reads on
Unstranded	`0`	Both strands
Sense (positive) strand	`1`	Same strand as feature
Antisense (negative) strand	`2`	Opposite strand to feature

For paired-end RNA-Seq data always set isPairedEnd = TRUE. The count matrix is accessed from the returned list via $counts. The nthreads argument controls parallelization — adjust to the number of available cores.

Task A1 — Generate three count tables

Using bam_files as input and eByg as the annotation, generate three count tables — one per strand mode. Pass grl_to_df(eByg) to annot.ext to convert the GRangesList to the data frame format required by featureCounts.

1. Unstranded (strandSpecific = 0)

fc_unstranded <- featureCounts(
    files                  = bam_files,
    annot.ext              = grl_to_df(eByg),
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 0,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
unstranded <- fc_unstranded$counts
unstranded[1:4, ]

2. Sense (positive) strand (strandSpecific = 1)

fc_stranded_pos <- featureCounts(
    files                  = bam_files,
    annot.ext              = grl_to_df(eByg),
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
stranded_pos <- fc_stranded_pos$counts
stranded_pos[1:4, ]

3. Antisense (negative) strand (strandSpecific = 2)

fc_stranded_neg <- featureCounts(
    files                  = bam_files,
    annot.ext              = grl_to_df(eByg),
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 2,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
stranded_neg <- fc_stranded_neg$counts
stranded_neg[1:4, ]

Store the count matrices in variables named exactly unstranded, stranded_pos, and stranded_neg.

Task A2 — Compare strand-specific counts to unstranded

Test whether the two strand-specific count tables sum to approximately the same values as the unstranded count table. Calculate for each strand-specific result the mean mapping frequency as a percentage of the unstranded counts:

unstranded[unstranded == 0] <- NA
perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE)
perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE)
perc_pos
perc_neg

Store results in variables named exactly perc_pos and perc_neg.

Interpretation: if both values are close to 50%, the RNA-Seq library is very likely unstranded. If one value dominates (e.g. ~100% sense, ~0% antisense), the library is strand-specific. For paired-end data expect very similar but not necessarily identical results between the sum of stranded counts and the unstranded total.

Task A3 — Biological interpretation of strand-specific counting

Include comment text in your script addressing the following four points:

When should strand-specific counting be used instead of unstranded counting?
How does strand-specific counting help resolve reads from overlapping genes encoded on opposite strands?
What biological information can antisense strand counting reveal?
How can comparing stranded and unstranded results determine whether an RNA-Seq experiment used a strand-specific library preparation protocol?

B. Read Counting for Different Feature Types

Background

By using different range extraction functions on txdb, reads can be counted over different genomic features (genes, exons, introns, UTRs). All annotation ranges are derived from the same TxDb object loaded from tair10.sqlite, keeping everything consistent with the RNA-Seq workflow. featureCounts accepts GRanges and GRangesList objects directly via annot.ext — set isGTFAnnotationFile = FALSE for all range-based inputs.

Task B — Count reads over five feature types

Compute sense strand (strandSpecific = 1) count tables for the following five feature types. Use isPairedEnd = TRUE for gene and exon features. For introns and UTRs use isPairedEnd = FALSE (single-end mode since these are sub-gene features).

1. Genes — exons grouped by gene (eByg from exonsBy)

## eByg already available from the input setup above
fc_genes <- featureCounts(
    files                  = bam_files,
    annot.ext              = grl_to_df(eByg),
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
gene_ranges_countDF <- fc_genes$counts
gene_ranges_countDF[1:4, ]

2. Exons — individual exon level using exons(txdb)

exon_ranges <- exons(txdb)
names(exon_ranges) <- names(unlist(eByg))
fc_exons <- featureCounts(
    files                  = bam_files,
    annot.ext              = grl_to_df(exon_ranges),
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
exon_ranges_countDF <- fc_exons$counts
exon_ranges_countDF[1:4, ]

3. Exons by genes — eByg is the direct equivalent of exonsBy(txdb, by="gene"); reuse it here for clarity

## eByg groups exons by gene — identical to the exonsBy approach
exonByg_ranges_countDF <- gene_ranges_countDF   # reuses result from Task B1
exonByg_ranges_countDF[1:4, ]

4. Introns by transcripts — convert with grl_to_df()

intron_ranges <- intronsByTranscript(txdb, use.names = TRUE)
fc_introns <- featureCounts(
    files                  = bam_files,
    annot.ext              = grl_to_df(intron_ranges),
    isGTFAnnotationFile    = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
intron_ranges_countDF <- fc_introns$counts
intron_ranges_countDF[1:4, ]

5. 5’-UTRs by transcripts — convert with grl_to_df()

fiveUTR_ranges <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fc_fiveUTR <- featureCounts(
    files                  = bam_files,
    annot.ext              = grl_to_df(fiveUTR_ranges),
    isGTFAnnotationFile    = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
fiveUTR_ranges_countDF <- fc_fiveUTR$counts
fiveUTR_ranges_countDF[1:4, ]

Use the exact variable names shown above: gene_ranges_countDF, exon_ranges_countDF, exonByg_ranges_countDF, intron_ranges_countDF, fiveUTR_ranges_countDF.

C. DEG Analysis with edgeR

Background

run_edgeR() from systemPipeR wraps the standard edgeR DEG workflow. Here you will run it on two count tables (unstranded vs sense strand) and compare the resulting DEG sets visually using 4-way Venn diagrams.

Task C — DEG analysis and Venn diagram comparison

1. Run edgeR on both count tables

cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]],
                format = "matrix", delim = "-")
## Alternative if sal is not available:
# cmp <- readComp(file = "targetsPE.txt", format = "matrix", delim = "-")

## DEG analysis with unstranded count table
colnames(unstranded) <- gsub("\\..*", "", colnames(unstranded))
edgeDF_unstranded <- run_edgeR(
    countDF     = unstranded,
    targets     = targetsWF(sal)[["hisat2_mapping"]],
    cmp         = cmp[[1]],
    independent = FALSE,
    mdsplot     = ""
)

## DEG analysis with sense strand count table
colnames(stranded_pos) <- gsub("\\..*", "", colnames(stranded_pos))
edgeDF_pos <- run_edgeR(
    countDF     = stranded_pos,
    targets     = targetsWF(sal)[["hisat2_mapping"]],
    cmp         = cmp[[1]],
    independent = FALSE,
    mdsplot     = ""
)

2. Generate and save 4-way Venn diagrams

## (1) Venn diagram — unstranded count table
DEG_list_unstranded <- filterDEGs(
    degDF  = edgeDF_unstranded,
    filter = c(Fold = 2, FDR = 40),
    plot   = FALSE
)
vennsetup   <- overLapper(DEG_list_unstranded$Up[6:9],   type = "vennsets")
vennsetdown <- overLapper(DEG_list_unstranded$Down[6:9], type = "vennsets")
pdf("results/DEGcount_unstranded.pdf")
vennPlot(list(vennsetup, vennsetdown),
         mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()

## (2) Venn diagram — sense strand count table
DEG_list_pos <- filterDEGs(
    degDF  = edgeDF_pos,
    filter = c(Fold = 2, FDR = 40),
    plot   = FALSE
)
vennsetup   <- overLapper(DEG_list_pos$Up[6:9],   type = "vennsets")
vennsetdown <- overLapper(DEG_list_pos$Down[6:9], type = "vennsets")
pdf("results/DEGcount_pos.pdf")
vennPlot(list(vennsetup, vennsetdown),
         mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()

Both PDF files must be committed to the repository at:

results/DEGcount_unstranded.pdf
results/DEGcount_pos.pdf

D. Bonus Challenge — edgeR in Native Mode with Time Course Design

Bonus: +1 extra point

This section is optional. Completing it earns 1 bonus point on top of the 5-point maximum, for a total possible score of 6/5. It is intentionally more challenging and is meant for students who want a deeper understanding of linear model design in RNA-Seq analysis.

Background

The run_edgeR() wrapper used in Part C is convenient but hides the underlying statistical machinery. Internally, it builds a simple pairwise design matrix and fits a generalized linear model (GLM). For the Howard et al. (2013) data set used throughout this workflow, a pairwise design is not the optimal approach — the experiment is a two-factor time course that deserves a more informative model.

The experimental design of Howard et al. (2013) consists of 18 samples:

Factor	Meaning	Time points	Replicates
`M`	Mock-inoculated (susceptible)	1h, 2h, 12h	A, B
`A`	Pseudomonas-infected (resistant)	1h, 2h, 12h	A, B

The sample names in targetsPE.txt encode this as: M1A, M1B, M2A, M2B, M12A, M12B (mock) and A1A, A1B, A2A, A2B, A12A, A12B (infected). The two factors are treatment (mock vs infected) and time (1h, 2h, 12h).

An appropriate model for this design captures:

The main effect of time
The main effect of treatment (mock vs infected)
The interaction between time and treatment — i.e. whether the treatment effect changes over time, which is the biologically interesting question

Task D — Run edgeR natively with an interaction model

Using the unstranded count table from Part A as input, implement the full native edgeR workflow with a ~ treatment + time + treatment:time interaction design.

Step 1 — Prepare sample metadata

library(edgeR)

## Load targets to get sample information
targets <- read.delim("targetsPE.txt", comment.char = "#")

## Extract treatment and time factors from the Factor column
## Factor levels: M1, M2, M12 (mock) and A1, A2, A12 (infected)
targets$treatment <- ifelse(grepl("^M", targets$Factor), "mock", "infected")
targets$time      <- gsub("^[MAV]", "", targets$Factor)   # extracts "1", "2", "12"
targets$time      <- factor(targets$time, levels = c("1", "6", "12"))
targets$treatment <- factor(targets$treatment, levels = c("mock", "infected"))

## Confirm the design
targets[, c("SampleName", "Factor", "treatment", "time")]

Step 2 — Create DGEList and filter low-count genes

## Match count table columns to sample order in targets
countDF <- unstranded[, targets$SampleName]

## Create DGEList object
dge <- DGEList(counts = countDF, group = targets$Factor)

## Filter genes with very low counts across all samples
## Keep genes with at least 1 count per million (CPM) in at least 2 samples
keep    <- rowSums(cpm(dge) >= 1) >= 2
dge     <- dge[keep, , keep.lib.sizes = FALSE]
message(sum(keep), " genes retained after low-count filtering")

## Normalize library sizes with TMM
dge <- calcNormFactors(dge, method = "TMM")

Step 3 — Build the interaction design matrix

## Interaction model: ~ treatment + time + treatment:time
## This tests whether the treatment effect differs across time points
design_tc <- model.matrix(~ treatment + time + treatment:time,
                          data = targets)
colnames(design_tc)
## Should show: Intercept, treatmentinfected, time2, time12,
##              treatmentinfected:time2, treatmentinfected:time12

Step 4 — Estimate dispersion and fit the GLM

## Estimate common, trended and tagwise dispersions
dge <- estimateDisp(dge, design_tc, robust = TRUE)
plotBCV(dge)   # optional: inspect biological coefficient of variation

## Fit the negative binomial GLM
fit_tc <- glmQLFit(dge, design_tc, robust = TRUE)

Step 5 — Test for genes responding differently across time in infected vs mock

Test the interaction terms — these identify genes where the treatment effect changes significantly over time (the core time course question):

## Test interaction coefficients:
## treatmentinfected:time2  and  treatmentinfected:time12
interaction_cols <- grep("treatmentinfected:time", colnames(design_tc))
qlf_interaction  <- glmQLFTest(fit_tc, coef = interaction_cols)

## Extract top DEGs
topTags(qlf_interaction, n = 20)

## Summary of significant genes at FDR < 0.05
## note: because the toy gene set contains only 127 genes,
## an unreasonably large p.value is chosen here to obtain at least some genes 
summary(decideTests(qlf_interaction, p.value = 0.3))

## Save full results table
tc_results <- as.data.frame(topTags(qlf_interaction, n = Inf))
write.csv(tc_results, "results/edgeR_timecourse_interaction.csv")

Step 6 — Compare with pairwise result from Part C

Briefly compare (in comment text) the number of DEGs identified by the interaction model versus the pairwise approach from Part C. Consider:

Which approach identified more DEGs? Is more genes better?
What methods would you use to determine which approach enriches for genes that align with the biology of the RNA-Seq experiment.
Which design better captures the biology of a time course experiment and why?
What biological question does each design answer?

Store your interaction results in a variable named exactly tc_results and save the CSV to results/edgeR_timecourse_interaction.csv.

Homework Submission

Submit one R script named HW6.R and the two Venn diagram PDFs to your private GitHub homework repository at the following paths:

Homework/HW6/HW6.R
results/DEGcount_unstranded.pdf
results/DEGcount_pos.pdf
results/edgeR_timecourse_interaction.csv   # bonus only

Requirements for full credit

Use featureCounts() from Rsubread for all read counting (Parts A and B)
Define all variables with the exact names specified throughout
Include comment text for all four points of Task A3
Commit both Venn diagram PDF files alongside HW6.R
Script must be well structured following the recommended layout below

Bonus requirement (+1 point)

To earn the bonus point, additionally include in HW6.R:

The full native edgeR workflow for the interaction model (~ treatment + time + treatment:time)
Variable tc_results containing the full DEG table from topTags(..., n = Inf)
File results/edgeR_timecourse_interaction.csv committed to the repo
Comment text comparing the interaction model result with the pairwise result from Part C

Recommended script structure

## ─────────────────────────────────────────────────────────────
## HW6: RNA-Seq Analysis
## GEN242, Spring 2026
## Author: <your name>
## ─────────────────────────────────────────────────────────────

library(Rsubread); library(GenomicFeatures)
library(systemPipeR); library(edgeR)

## Input
txdb      <- loadDb("./data/tair10.sqlite")
eByg      <- exonsBy(txdb, by = "gene")
bam_files <- getColumn(sal, step = "hisat2_mapping",
                       "outfiles", column = "samtools_sort_bam")

## Helper: convert GRangesList/GRanges to featureCounts annotation data frame
grl_to_df <- function(grl) {
    gr <- unlist(grl)
    data.frame(GeneID = names(gr), Chr = as.character(seqnames(gr)),
               Start  = start(gr), End = end(gr),
               Strand = as.character(strand(gr)), stringsAsFactors = FALSE)
}

## ── Part A: Strand-specific read counting ─────────────────────

## Task A1 — unstranded (strandSpecific = 0)
fc_unstranded <- featureCounts(files = bam_files, annot.ext = grl_to_df(eByg),
                               isGTFAnnotationFile = FALSE,
                               useMetaFeatures = TRUE, strandSpecific = 0,
                               isPairedEnd = TRUE, nthreads = 4)
unstranded <- fc_unstranded$counts

## Task A1 — sense strand (strandSpecific = 1)
fc_stranded_pos <- featureCounts( <fill in arguments, strandSpecific = 1> )
stranded_pos <- fc_stranded_pos$counts

## Task A1 — antisense strand (strandSpecific = 2)
fc_stranded_neg <- featureCounts( <fill in arguments, strandSpecific = 2> )
stranded_neg <- fc_stranded_neg$counts

## Task A2 — percentage comparison
perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE)
perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE)

## Task A3 — biological interpretation (replace ... with your answers)
## 1. Strand-specific counting should be used when the library is stranded because ...
## 2. For overlapping genes on opposite strands, strand-specific counting allows ...
## 3. Antisense strand counting can reveal biological events such as ...
## 4. Comparing stranded and unstranded results determines strandedness because ...

## ── Part B: Feature type counting ────────────────────────────

## B1: Genes
gene_ranges_countDF    <- featureCounts( <fill in, annot.ext = grl_to_df(eByg)> )$counts

## B2: Exons (individual)
exon_ranges            <- exons(txdb)
exon_ranges_countDF    <- featureCounts( <fill in, annot.ext = grl_to_df(exon_ranges)> )$counts

## B3: Exons by genes
exonByg_ranges_countDF <- gene_ranges_countDF

## B4: Introns
intron_ranges          <- intronsByTranscript(txdb, use.names = TRUE)
intron_ranges_countDF  <- featureCounts( <fill in, annot.ext = grl_to_df(intron_ranges)> )$counts

## B5: 5'-UTRs
fiveUTR_ranges         <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fiveUTR_ranges_countDF <- featureCounts( <fill in, annot.ext = grl_to_df(fiveUTR_ranges)> )$counts

## ── Part C: DEG analysis ──────────────────────────────────────

cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]], format = "matrix", delim = "-")

edgeDF_unstranded <- run_edgeR(countDF = unstranded,
                               targets = targetsWF(sal)[["hisat2_mapping"]],
                               cmp = cmp[[1]], independent = FALSE, mdsplot = "")
edgeDF_pos        <- run_edgeR(countDF = stranded_pos,
                               targets = targetsWF(sal)[["hisat2_mapping"]],
                               cmp = cmp[[1]], independent = FALSE, mdsplot = "")

## Venn diagrams → results/DEGcount_unstranded.pdf and results/DEGcount_pos.pdf
## <add filterDEGs, overLapper, vennPlot calls here>

Due Date

This homework is due Thursday, May 21st at 6:00 PM.

Homework Solutions

See here.