GEN242: Data Analysis in Genome Biology
2026-04-23
Topics covered in this tutorial:
IRanges and GRangesTxDb databasesNote
HW06 tasks are linked at the end of this tutorial.
Full assignment: HW06
Basic string handling (grep, gsub, substring) plus the full range of numeric data analysis tools.
| Package | Purpose |
|---|---|
| Biostrings | General sequence analysis — the core package |
| ShortRead | Pipeline for short read / FASTQ data |
| IRanges | Low-level range data infrastructure |
| GenomicRanges | High-level genomic range data |
| GenomicFeatures | Transcript-centric annotation management |
| GenomicAlignments | Short genomic alignments |
| Rsamtools | Interface to samtools, bcftools, tabix |
| BSgenome | Genome annotation data |
| biomaRt | Interface to BioMart annotations |
| rtracklayer | Annotation imports, interface to genome browsers |
myseq[grep("ATG", myseq)] # sequences containing "ATG"
pos1 <- regexpr("AT", myseq) # position of FIRST match per sequence
as.numeric(pos1)
attributes(pos1)$match.length
pos2 <- gregexpr("AT", myseq) # positions of ALL matches per sequence
as.numeric(pos2[[1]]) # match positions in first sequence
attributes(pos2[[1]])$match.length # match lengths in first sequenceThe Biostrings package provides dedicated containers for biological sequences.
XString — single sequence| Class | Use |
|---|---|
DNAString |
DNA sequence |
RNAString |
RNA sequence |
AAString |
Amino acid sequence |
BString |
Any character string |
XStringSet — many sequences in one object| Class | Use |
|---|---|
DNAStringSet |
set of DNA sequences |
RNAStringSet |
set of RNA sequences |
AAStringSet |
set of amino acid sequences |
dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC"))
names(dset) <- c("seq1", "seq2", "seq3")
dset[1:2]
width(dset) # length of each sequence
d <- dset[[1]] # [[ returns single entry as XString
dset2 <- c(dset, dset) # concatenate two XStringSet objects
dsetchar <- as.character(dset) # convert to named character vector
dsetone <- unlist(dset) # collapse all sequences into one DNAString
DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5)) # subset by positionsQualityScaleXStringSet — sequences with quality scoresBiostrings supports pattern matching with mismatches — not possible with base R regex.
vcountPattern("ATGGCT", myseq1, max.mismatch=1)[1:20] # count per sequence
myvpos <- vmatchPattern("ATGGCT", myseq1, max.mismatch=1) # match positions (MIndex)
Views(myseq1[[1]], start(myvpos[[1]]), end(myvpos[[1]])) # retrieve matches for seq 1
# Retrieve match sequences across all entries
sapply(names(myseq1), function(x)
as.character(Views(myseq1[[x]], start(myvpos[[x]]), end(myvpos[[x]]))))[1:4]seqLogoggseqlogo (more customization options)ggseqlogo supports various alphabets including amino acid sequences. See the manual for details.
FASTQ is the standard format for raw NGS reads. Each read occupies 4 lines:
@SRR038845.3 HWI-EAS038:6:1:0:1938 length=36
CAACGAGTTCACACCTTGGCCGACAGGCCCGGGTAA
+SRR038845.3 HWI-EAS038:6:1:0:1938 length=36
BA@7>B=>:>>7@7@>>9=BAA?;>52;>:9=8.=A| Line | Content |
|---|---|
| 1 | @ + read ID and description |
| 2 | DNA sequence |
| 3 | + (optionally repeated ID) |
| 4 | Quality scores as ASCII characters (Phred encoding) |
Phred scores are integers 0–50 stored as ASCII characters (score + 33):
library(ShortRead)
download.file(
"https://cluster.hpcc.ucr.edu/~tgirke/HTML_Presentations/Manuals/testdata/samplefastq/data.zip",
"data.zip"
)
unzip("data.zip")
fastq <- list.files("data", "*.fastq$")
fastq <- paste("data/", fastq, sep="")
names(fastq) <- paste("flowcell6_lane", 1:length(fastq), sep="_")fq <- readFastq(fastq[1]) # import first FASTQ file as ShortReadQ object
countLines(dirPath="./data", pattern=".fastq$") / 4 # count reads in each file
id(fq)[1] # read ID
sread(fq)[1] # sequence
quality(fq)[1] # Phred quality scores (ASCII)
as(quality(fq), "matrix")[1:4, 1:12] # convert Phred to numeric matrix
ShortReadQ(sread=sread(fq), quality=quality(fq), id=id(fq)) # construct from componentsQualityScaledDNAStringSet from scratchdset <- DNAStringSet(sapply(1:100, function(x)
paste(sample(c("A","T","G","C"), 20, replace=TRUE), collapse="")))
myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=TRUE))
myqual <- sapply(myqlist, function(x) toString(PhredQuality(x)))
myqual <- PhredQuality(myqual)
dsetq1 <- QualityScaledDNAStringSet(dset, myqual)
dsetq1[1:2]systemPipeR — comprehensive QC plotsGenerates a multi-panel QC report covering: - Read length distribution - Per-position base quality - Per-position nucleotide frequency - K-mer frequency - Adapter contamination
Output can be written to a single PDF covering all samples.
ShortRead — alternative QCsp <- SolexaPath(system.file('extdata', package='ShortRead'))
fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
fls <- c(fl, fl)
coll <- QACollate(
QAFastqSource(fls),
QAReadQuality(),
QAAdapterContamination(),
QANucleotideUse(),
QAQualityUse(),
QASequenceUse(),
QAFrequentSequence(n=10),
QANucleotideByCycle(),
QAQualityByCycle()
)
x <- qa2(coll, verbose=TRUE)
res <- report(x)
if (interactive()) browseURL(res) # open HTML report in browserAll filtering operates on ShortReadQ objects and returns a filtered subset.
For large FASTQ files, streaming avoids loading everything into memory at once.
f <- FastqStreamer(fastq[1], 50) # open streamer with chunk size 50
while (length(fq <- yield(f))) {
fqsub <- fq[grepl("^TT", sread(fq))] # keep reads starting with "TT"
writeFastq(fqsub,
paste(fastq[1], "sub", sep="_"),
mode="a", # append mode — builds file across chunks
compress=FALSE)
}
close(f) # always close the streamerTip
Streaming workflow is the standard approach for production FASTQ processing: open a FastqStreamer, yield chunks, apply filters/trimming, write output, repeat until the file is exhausted. This handles files of any size with constant memory usage.
Genomic ranges (chromosome, start, end, strand) are the foundation of most genome-scale analyses in Bioconductor.
| Class | Package | Stores |
|---|---|---|
IRanges |
IRanges | integer ranges only |
GRanges |
GenomicRanges | ranges + chromosome + strand + metadata |
GRangesList |
GenomicRanges | list of GRanges objects |
GRanges objectGRanges objectTip
findOverlaps and subsetByOverlaps are among the most frequently used functions in genomic analyses — they power intersection of features with peaks, reads, SNPs, and any other range-based data.
TxDb (TranscriptDb) objects store transcript, exon, and CDS ranges in a SQLite database — making annotation retrieval robust and convenient.
TxDb from a GFF filelibrary(txdbmaker)
download.file(
"https://cluster.hpcc.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff",
"data/gff3.gff"
)
txdb <- makeTxDbFromGFF(
file = "data/gff3.gff",
format = "gff",
dataSource = "TAIR",
organism = "Arabidopsis thaliana"
)
saveDb(txdb, file="./data/TAIR10.sqlite") # save for reuse
txdb <- loadDb("./data/TAIR10.sqlite") # reload later
transcripts(txdb) # all transcripts as GRanges
transcriptsBy(txdb, by="gene") # transcripts grouped by gene
exonsBy(txdb, by="gene") # exons grouped by geneTxDb from BioMartlibrary(GenomicFeatures); library(txdbmaker); library(biomaRt)
txdb <- makeTxDbFromBiomart(
biomart = "plants_mart",
dataset = "athaliana_eg_gene",
host = "https://plants.ensembl.org"
)
# Explore BioMart databases
listMarts(host="plants.ensembl.org")
mymart <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
listAttributes(mymart)
getBM(attributes=c("ensembl_gene_id", "description"), mart=mymart)[1:4,]getSeq — parse sequences for any GRanges annotationsParse all ranges defined in a GRanges object from a genome FASTA file:
library(Biostrings); library(Rsamtools)
# Create a random test genome and write to file
gff <- gff[values(gff)$type != "chromosome"]
rand <- DNAStringSet(sapply(
unique(as.character(seqnames(gff))),
function(x) paste(sample(c("A","T","G","C"), 200000, replace=TRUE), collapse="")
))
writeXStringSet(DNAStringSet(rand), "./data/test")
# Parse sequences for all annotated ranges
getSeq(FaFile("./data/test"), gff)extractTranscriptSeqs — splice-aware transcript parsingFor multi-exon transcripts, exon sequences must be extracted and concatenated. extractTranscriptSeqs handles this automatically:
Tip
Use getSeq for simple range extraction (genes, peaks, features).
Use extractTranscriptSeqs when you need spliced transcript sequences that span multiple exons.
Important
Assignment: HW05 — NGS Analysis Basics
The homework is based on the sequence analysis, FASTQ processing, and range operations covered in this tutorial.
| Function | Purpose |
|---|---|
readDNAStringSet() |
import FASTA file |
writeXStringSet() |
export to FASTA |
reverseComplement() |
reverse complement |
translate() |
DNA → protein |
matchPattern() / vmatchPattern() |
pattern matching (with mismatches) |
countPattern() / vcountPattern() |
count matches |
PWM() / matchPWM() |
position weight matrix search |
| Function | Purpose |
|---|---|
readFastq() |
import FASTQ file |
sread(), quality(), id() |
access sequence, quality, ID |
seeFastqPlot() |
QC report (systemPipeR) |
trimLRPatterns() |
adapter trimming |
trimTails() |
quality tail trimming |
nFilter(), polynFilter() |
N and complexity filters |
FastqStreamer() / yield() |
memory-efficient streaming |
| Function | Purpose |
|---|---|
import.gff() |
GFF/GTF → GRanges |
findOverlaps() |
find overlapping ranges |
subsetByOverlaps() |
keep overlapping ranges |
reduce() / coverage() |
merge / per-base coverage |
getSeq() |
extract sequences for ranges |
extractTranscriptSeqs() |
spliced transcript sequences |
Next: T7 — Workflow Framework