HW06 — RNA-Seq Analysis
Source code downloads: [ .qmd ]
Overview
This homework covers read counting strategies and differential expression analysis (DEG) from the RNA-Seq Workflow tutorial. It has three parts:
- Part A — Compare unstranded and strand-specific read counting with
featureCounts - Part B — Count reads over different genomic feature types with
featureCounts - Part C — DEG analysis with
edgeRand Venn diagram visualization
All tasks build on the RNA-Seq workflow environment generated with systemPipeRdata::genWorkenvir(workflow = "rnaseq"). Complete the workflow through the HISAT2 mapping step before starting this homework — the sorted BAM files and sal object it produces are required as input throughout.
Required packages
Required input objects
Run the following once inside your rnaseq workflow directory to prepare shared inputs used across all tasks. The TxDb object (tair10.sqlite) is created from the GFF annotation file during the RNA-Seq workflow and saved to disk — load it with loadDb(). All annotation ranges for featureCounts are derived from this TxDb.
## Load TxDb created during the RNA-Seq workflow
txdb <- loadDb("./data/tair10.sqlite")
## Exons grouped by gene — used as annotation for featureCounts
## in Parts A and B (gene/exon feature types)
eByg <- exonsBy(txdb, by = "gene")
## Paths to sorted BAM files produced by HISAT2 mapping step
bam_files <- getColumn(sal, step = "hisat2_mapping",
"outfiles", column = "samtools_sort_bam")
## Alternative if sal object is not available:
# bam_files <- list.files("results/hisat2_mapping/",
# pattern = "sorted.bam$", full.names = TRUE)A. Unstranded and Strand-Specific Read Counting
Background
featureCounts from the Rsubread package counts reads overlapping genomic features. When a GRangesList object such as eByg (exons grouped by gene) is passed via annot.ext, featureCounts treats each list element as one meta-feature (gene). This is equivalent to the exonsBy(txdb, by="gene") approach used throughout the RNA-Seq workflow and is more flexible than relying on a GTF file. Set isGTFAnnotationFile = FALSE when passing a range object. The strandSpecific argument controls strand awareness:
| Mode | strandSpecific |
Counts reads on |
|---|---|---|
| Unstranded | 0 |
Both strands |
| Sense (positive) strand | 1 |
Same strand as feature |
| Antisense (negative) strand | 2 |
Opposite strand to feature |
For paired-end RNA-Seq data always set isPairedEnd = TRUE. The count matrix is accessed from the returned list via $counts. The nthreads argument controls parallelization — adjust to the number of available cores.
Task A1 — Generate three count tables
Using bam_files as input and eByg as the annotation, generate three count tables — one per strand mode. Since eByg is already a GRangesList of exons grouped by gene, set useMetaFeatures = TRUE and isGTFAnnotationFile = FALSE.
1. Unstranded (strandSpecific = 0)
2. Sense (positive) strand (strandSpecific = 1)
3. Antisense (negative) strand (strandSpecific = 2)
Store the count matrices in variables named exactly unstranded, stranded_pos, and stranded_neg.
Task A2 — Compare strand-specific counts to unstranded
Test whether the two strand-specific count tables sum to approximately the same values as the unstranded count table. Calculate for each strand-specific result the mean mapping frequency as a percentage of the unstranded counts:
Store results in variables named exactly perc_pos and perc_neg.
Interpretation: if both values are close to 50%, the RNA-Seq library is very likely unstranded. If one value dominates (e.g. ~100% sense, ~0% antisense), the library is strand-specific. For paired-end data expect very similar but not necessarily identical results between the sum of stranded counts and the unstranded total.
Task A3 — Biological interpretation of strand-specific counting
Include comment text in your script addressing the following four points:
- When should strand-specific counting be used instead of unstranded counting?
- How does strand-specific counting help resolve reads from overlapping genes encoded on opposite strands?
- What biological information can antisense strand counting reveal?
- How can comparing stranded and unstranded results determine whether an RNA-Seq experiment used a strand-specific library preparation protocol?
B. Read Counting for Different Feature Types
Background
By using different range extraction functions on txdb, reads can be counted over different genomic features (genes, exons, introns, UTRs). All annotation ranges are derived from the same TxDb object loaded from tair10.sqlite, keeping everything consistent with the RNA-Seq workflow. featureCounts accepts GRanges and GRangesList objects directly via annot.ext — set isGTFAnnotationFile = FALSE for all range-based inputs.
Task B — Count reads over five feature types
Compute sense strand (strandSpecific = 1) count tables for the following five feature types. Use isPairedEnd = TRUE for gene and exon features. For introns and UTRs use isPairedEnd = FALSE (single-end mode since these are sub-gene features).
1. Genes — exons grouped by gene (eByg from exonsBy)
## eByg already available from the input setup above
fc_genes <- featureCounts(
files = bam_files,
annot.ext = eByg,
isGTFAnnotationFile = FALSE,
useMetaFeatures = TRUE,
strandSpecific = 1,
isPairedEnd = TRUE,
countMultiMappingReads = FALSE,
nthreads = 4
)
gene_ranges_countDF <- fc_genes$counts
gene_ranges_countDF[1:4, ]2. Exons — individual exon level using exons(txdb)
exon_ranges <- exons(txdb)
fc_exons <- featureCounts(
files = bam_files,
annot.ext = exon_ranges,
isGTFAnnotationFile = FALSE,
useMetaFeatures = FALSE,
strandSpecific = 1,
isPairedEnd = TRUE,
countMultiMappingReads = FALSE,
nthreads = 4
)
exon_ranges_countDF <- fc_exons$counts
exon_ranges_countDF[1:4, ]3. Exons by genes — eByg is the direct equivalent of exonsBy(txdb, by="gene"); reuse it here for clarity
4. Introns by transcripts — intronsByTranscript(txdb) returns a GRangesList that is passed directly to featureCounts
intron_ranges <- intronsByTranscript(txdb, use.names = TRUE)
fc_introns <- featureCounts(
files = bam_files,
annot.ext = intron_ranges,
isGTFAnnotationFile = FALSE,
strandSpecific = 1,
isPairedEnd = FALSE,
countMultiMappingReads = FALSE,
nthreads = 4
)
intron_ranges_countDF <- fc_introns$counts
intron_ranges_countDF[1:4, ]5. 5’-UTRs by transcripts — fiveUTRsByTranscript(txdb) passed directly
fiveUTR_ranges <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fc_fiveUTR <- featureCounts(
files = bam_files,
annot.ext = fiveUTR_ranges,
isGTFAnnotationFile = FALSE,
strandSpecific = 1,
isPairedEnd = FALSE,
countMultiMappingReads = FALSE,
nthreads = 4
)
fiveUTR_ranges_countDF <- fc_fiveUTR$counts
fiveUTR_ranges_countDF[1:4, ]Use the exact variable names shown above: gene_ranges_countDF, exon_ranges_countDF, exonByg_ranges_countDF, intron_ranges_countDF, fiveUTR_ranges_countDF.
C. DEG Analysis with edgeR
Background
run_edgeR() from systemPipeR wraps the standard edgeR DEG workflow. Here you will run it on two count tables (unstranded vs sense strand) and compare the resulting DEG sets visually using 4-way Venn diagrams.
Task C — DEG analysis and Venn diagram comparison
1. Run edgeR on both count tables
cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]],
format = "matrix", delim = "-")
## Alternative if sal is not available:
# cmp <- readComp(file = "targetsPE.txt", format = "matrix", delim = "-")
## DEG analysis with unstranded count table
edgeDF_unstranded <- run_edgeR(
countDF = unstranded,
targets = targetsWF(sal)[["hisat2_mapping"]],
cmp = cmp[[1]],
independent = FALSE,
mdsplot = ""
)
## DEG analysis with sense strand count table
edgeDF_pos <- run_edgeR(
countDF = stranded_pos,
targets = targetsWF(sal)[["hisat2_mapping"]],
cmp = cmp[[1]],
independent = FALSE,
mdsplot = ""
)2. Generate and save 4-way Venn diagrams
## (1) Venn diagram — unstranded count table
DEG_list_unstranded <- filterDEGs(
degDF = edgeDF_unstranded,
filter = c(Fold = 2, FDR = 40),
plot = FALSE
)
vennsetup <- overLapper(DEG_list_unstranded$Up[6:9], type = "vennsets")
vennsetdown <- overLapper(DEG_list_unstranded$Down[6:9], type = "vennsets")
pdf("results/DEGcount_unstranded.pdf")
vennPlot(list(vennsetup, vennsetdown),
mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()
## (2) Venn diagram — sense strand count table
DEG_list_pos <- filterDEGs(
degDF = edgeDF_pos,
filter = c(Fold = 2, FDR = 40),
plot = FALSE
)
vennsetup <- overLapper(DEG_list_pos$Up[6:9], type = "vennsets")
vennsetdown <- overLapper(DEG_list_pos$Down[6:9], type = "vennsets")
pdf("results/DEGcount_pos.pdf")
vennPlot(list(vennsetup, vennsetdown),
mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()Both PDF files must be committed to the repository at:
results/DEGcount_unstranded.pdfresults/DEGcount_pos.pdf
D. Bonus Challenge — edgeR in Native Mode with Time Course Design
This section is optional. Completing it earns 1 bonus point on top of the 5-point maximum, for a total possible score of 6/5. It is intentionally more challenging and is meant for students who want a deeper understanding of linear model design in RNA-Seq analysis.
Background
The run_edgeR() wrapper used in Part C is convenient but hides the underlying statistical machinery. Internally, it builds a simple pairwise design matrix and fits a generalized linear model (GLM). For the Howard et al. (2013) data set used throughout this workflow, a pairwise design is not the optimal approach — the experiment is a two-factor time course that deserves a more informative model.
The experimental design of Howard et al. (2013) consists of 18 samples:
| Factor | Meaning | Time points | Replicates |
|---|---|---|---|
M |
Mock-inoculated (susceptible) | 1h, 2h, 12h | A, B |
A |
Pseudomonas-infected (resistant) | 1h, 2h, 12h | A, B |
The sample names in targetsPE.txt encode this as: M1A, M1B, M2A, M2B, M12A, M12B (mock) and A1A, A1B, A2A, A2B, A12A, A12B (infected). The two factors are treatment (mock vs infected) and time (1h, 2h, 12h).
An appropriate model for this design captures:
- The main effect of time
- The main effect of treatment (mock vs infected)
- The interaction between time and treatment — i.e. whether the treatment effect changes over time, which is the biologically interesting question
Task D — Run edgeR natively with an interaction model
Using the unstranded count table from Part A as input, implement the full native edgeR workflow with a ~ treatment + time + treatment:time interaction design.
Step 1 — Prepare sample metadata
library(edgeR)
## Load targets to get sample information
targets <- read.delim("targetsPE.txt", comment.char = "#")
## Extract treatment and time factors from the Factor column
## Factor levels: M1, M2, M12 (mock) and A1, A2, A12 (infected)
targets$treatment <- ifelse(grepl("^M", targets$Factor), "mock", "infected")
targets$time <- gsub("^[MA]", "", targets$Factor) # extracts "1", "2", "12"
targets$time <- factor(targets$time, levels = c("1", "2", "12"))
targets$treatment <- factor(targets$treatment, levels = c("mock", "infected"))
## Confirm the design
targets[, c("SampleName", "Factor", "treatment", "time")]Step 2 — Create DGEList and filter low-count genes
## Match count table columns to sample order in targets
countDF <- unstranded[, targets$SampleName]
## Create DGEList object
dge <- DGEList(counts = countDF, group = targets$Factor)
## Filter genes with very low counts across all samples
## Keep genes with at least 1 count per million (CPM) in at least 2 samples
keep <- rowSums(cpm(dge) >= 1) >= 2
dge <- dge[keep, , keep.lib.sizes = FALSE]
message(sum(keep), " genes retained after low-count filtering")
## Normalize library sizes with TMM
dge <- calcNormFactors(dge, method = "TMM")Step 3 — Build the interaction design matrix
## Interaction model: ~ treatment + time + treatment:time
## This tests whether the treatment effect differs across time points
design_tc <- model.matrix(~ treatment + time + treatment:time,
data = targets)
colnames(design_tc)
## Should show: Intercept, treatmentinfected, time2, time12,
## treatmentinfected:time2, treatmentinfected:time12Step 4 — Estimate dispersion and fit the GLM
Step 5 — Test for genes responding differently across time in infected vs mock
Test the interaction terms — these identify genes where the treatment effect changes significantly over time (the core time course question):
## Test interaction coefficients:
## treatmentinfected:time2 and treatmentinfected:time12
interaction_cols <- grep("treatmentinfected:time", colnames(design_tc))
qlf_interaction <- glmQLFTest(fit_tc, coef = interaction_cols)
## Extract top DEGs
topTags(qlf_interaction, n = 20)
## Summary of significant genes at FDR < 0.05
summary(decideTests(qlf_interaction, p.value = 0.05))
## Save full results table
tc_results <- as.data.frame(topTags(qlf_interaction, n = Inf))
write.csv(tc_results, "results/edgeR_timecourse_interaction.csv")Step 6 — Compare with pairwise result from Part C
Briefly compare (in comment text) the number of DEGs identified by the interaction model versus the pairwise approach from Part C. Consider:
- Which approach identified more DEGs?
- Which design better captures the biology of a time course experiment and why?
- What biological question does each design answer?
Store your interaction results in a variable named exactly tc_results and save the CSV to results/edgeR_timecourse_interaction.csv.
Homework Submission
Submit one R script named HW6.R and the two Venn diagram PDFs to your private GitHub homework repository at the following paths:
Homework/HW6/HW6.R
results/DEGcount_unstranded.pdf
results/DEGcount_pos.pdf
results/edgeR_timecourse_interaction.csv # bonus only
Requirements for full credit
- Use
featureCounts()fromRsubreadfor all read counting (Parts A and B) - Define all variables with the exact names specified throughout
- Include comment text for all four points of Task A3
- Commit both Venn diagram PDF files alongside
HW6.R - Script must be well structured following the recommended layout below
Bonus requirement (+1 point)
To earn the bonus point, additionally include in HW6.R:
- The full native edgeR workflow for the interaction model (
~ treatment + time + treatment:time) - Variable
tc_resultscontaining the full DEG table fromtopTags(..., n = Inf) - File
results/edgeR_timecourse_interaction.csvcommitted to the repo - Comment text comparing the interaction model result with the pairwise result from Part C
Recommended script structure
## ─────────────────────────────────────────────────────────────
## HW6: RNA-Seq Analysis
## GEN242, Spring 2026
## Author: <your name>
## ─────────────────────────────────────────────────────────────
library(Rsubread); library(GenomicFeatures)
library(systemPipeR); library(edgeR)
## Input
txdb <- loadDb("./data/tair10.sqlite")
eByg <- exonsBy(txdb, by = "gene")
bam_files <- getColumn(sal, step = "hisat2_mapping",
"outfiles", column = "samtools_sort_bam")
## ── Part A: Strand-specific read counting ─────────────────────
## Task A1 — unstranded (strandSpecific = 0)
fc_unstranded <- featureCounts(files = bam_files, annot.ext = eByg,
isGTFAnnotationFile = FALSE,
useMetaFeatures = TRUE, strandSpecific = 0,
isPairedEnd = TRUE, nthreads = 4)
unstranded <- fc_unstranded$counts
## Task A1 — sense strand (strandSpecific = 1)
fc_stranded_pos <- featureCounts( <fill in arguments, strandSpecific = 1> )
stranded_pos <- fc_stranded_pos$counts
## Task A1 — antisense strand (strandSpecific = 2)
fc_stranded_neg <- featureCounts( <fill in arguments, strandSpecific = 2> )
stranded_neg <- fc_stranded_neg$counts
## Task A2 — percentage comparison
perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE)
perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE)
## Task A3 — biological interpretation (replace ... with your answers)
## 1. Strand-specific counting should be used when the library is stranded because ...
## 2. For overlapping genes on opposite strands, strand-specific counting allows ...
## 3. Antisense strand counting can reveal biological events such as ...
## 4. Comparing stranded and unstranded results determines strandedness because ...
## ── Part B: Feature type counting ────────────────────────────
## B1: Genes (eByg = exonsBy(txdb, by="gene"))
gene_ranges_countDF <- featureCounts( <fill in, annot.ext = eByg> )$counts
## B2: Exons (individual)
exon_ranges <- exons(txdb)
exon_ranges_countDF <- featureCounts( <fill in, annot.ext = exon_ranges> )$counts
## B3: Exons by genes
exonByg_ranges_countDF <- gene_ranges_countDF
## B4: Introns
intron_ranges <- intronsByTranscript(txdb, use.names = TRUE)
intron_ranges_countDF <- featureCounts( <fill in, annot.ext = intron_ranges> )$counts
## B5: 5'-UTRs
fiveUTR_ranges <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fiveUTR_ranges_countDF <- featureCounts( <fill in, annot.ext = fiveUTR_ranges> )$counts
## ── Part C: DEG analysis ──────────────────────────────────────
cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]], format = "matrix", delim = "-")
edgeDF_unstranded <- run_edgeR(countDF = unstranded,
targets = targetsWF(sal)[["hisat2_mapping"]],
cmp = cmp[[1]], independent = FALSE, mdsplot = "")
edgeDF_pos <- run_edgeR(countDF = stranded_pos,
targets = targetsWF(sal)[["hisat2_mapping"]],
cmp = cmp[[1]], independent = FALSE, mdsplot = "")
## Venn diagrams → results/DEGcount_unstranded.pdf and results/DEGcount_pos.pdf
## <add filterDEGs, overLapper, vennPlot calls here>Due Date
This homework is due Tuesday, May 20th at 6:00 PM.
Homework Solutions
To be posted after the due date.