HW05 — NGS Analysis Basics

Source code downloads: [ .R ]

Overview

This homework covers two topics from the NGS Analysis Basics tutorial: (A) extending a demultiplexing function to also trim low-quality read tails, and (B) parsing genomic sequences using GFF annotation and Bioconductor range operations. Both parts require the Biostrings, ShortRead, GenomicRanges, and rtracklayer Bioconductor packages.

Install required packages if needed:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("Biostrings", "ShortRead", "GenomicRanges", "rtracklayer"))

A. Demultiplexing with Quality Trimming

Background

Demultiplexing splits a multiplexed FASTQ file into separate files based on barcode sequences at the start of each read. A common additional step is to remove low-quality bases from the 3’ end of each read (quality tail trimming) before writing the output.

The following starter function handles demultiplexing only. Your task is to extend it to also perform quality tail trimming.

Starter function (demultiplexing only)

library(ShortRead)

demultiplex <- function(x, barcode, nreads) {
    f <- FastqStreamer(x, nreads)
    while(length(fq <- yield(f))) {
        for(i in barcode) {
            pattern <- paste("^", i, sep = "")
            fqsub <- fq[grepl(pattern, sread(fq))]
            if(length(fqsub) > 0) {
                writeFastq(fqsub, paste(x, i, sep = "_"),
                           mode = "a", compress = FALSE)
            }
        }
    }
    close(f)
}

Task A — Extend `demultiplex` to trim low-quality tails

Write a new function named demultiplex_trim that extends the starter function above with a quality trimming step. The function must have the following exact signature:

demultiplex_trim <- function(x, barcode, nreads, cutoff = 20) { ... }

Arguments:

Argument	Type	Description
`x`	character	Path to input FASTQ file
`barcode`	character vector	One or more barcode sequences to match at read start
`nreads`	integer	Number of reads to process per chunk (passed to `FastqStreamer`)
`cutoff`	integer	Minimum quality score threshold for trimming (default: 20)

Required behavior:

Stream the FASTQ file in chunks of nreads reads
For each barcode, subset reads whose sequence starts with that barcode (same as starter)
Trim low-quality bases from the 3’ tail of each matched read before writing — remove trailing bases whose quality score falls below cutoff
Write trimmed, barcode-matched reads to a file named <x>_<barcode> using writeFastq(..., mode="a", compress=FALSE)

Useful functions for quality trimming:

## trimTails() from ShortRead trims bases from the 3' end
## where quality drops below a threshold
## Arguments: object, k (min consecutive good bases), a (quality character threshold)
trimTails(fqsub, k = 2, a = encoding(cutoff))

## Alternatively, use trimEnds() for a simpler threshold approach
## PhredQuality encoding: quality score 20 = ASCII '5'
## Use encoding() to convert numeric cutoff to the correct character
quality(fqsub)  # inspect quality scores
rawToChar(as.raw(cutoff+33)) or encoding(cutoff) # see score-to-character mapping

Test data: Download the FASTQ test files from the tutorial:

## Download FASTQ test files as used in the NGS tutorial, or obtain from ShortRead package:
library(ShortRead)
fastq <- list.files(system.file("extdata", "E-MTAB-1147", package = "ShortRead"), full.names=TRUE)


## Run your function on the first test file
demultiplex_trim(x = fastq[1], barcode = c("TT", "AA", "GG"),
                 nreads = 50, cutoff = 20)

Expected output: One FASTQ file per barcode (e.g. <fastq[1]>_TT, <fastq[1]>_AA, <fastq[1]>_GG) in the working directory, containing trimmed reads that matched the respective barcode.

B. Sequence Parsing from GFF and Genome

Background

Genomic annotation files in GFF format define the coordinates of genes and other features on a reference genome. In this task you will use Bioconductor range operations to extract gene sequences, reduce overlapping ranges, and generate intergenic sequences from the Halobacterium sp. genome.

Download data

Download the GFF annotation file and genome FASTA for Halobacterium sp.:

library(Biostrings)
library(GenomicRanges)
library(rtracklayer)

dir.create("data", showWarnings = FALSE)

download.file(
    "https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.gff",
    "data/AE004437.gff"
)
download.file(
    "https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.fna",
    "data/AE004437.fna"
)

## Import genome and GFF
chr    <- readDNAStringSet("data/AE004437.fna")
gff    <- import("data/AE004437.gff")

Task B1 — Extract gene sequences and translate to protein

Extract all gene ranges from the GFF, parse their sequences from the genome, and translate them into protein sequences. Account for strand: genes on the + strand are translated directly; genes on the - strand must be reverse-complemented before translation.

Write the combined protein sequences to a file named data/mypep.fasta.

Required variable names (used by the grading script):

Variable	Content
`gffgene`	GRanges object — gene features only (type == “gene”)
`gene`	`DNAStringSet` of gene sequences parsed from genome
`p1`	`AAStringSet` of translated proteins from `+` strand genes
`p2`	`AAStringSet` of translated proteins from `-` strand genes

Useful commands:

## Filter GFF to gene features only
gffgene <- gff[values(gff)[, "type"] == "gene"]

## Extract gene sequences using Views
gene <- DNAStringSet(Views(chr[[1]],
                           IRanges(start(gffgene), end(gffgene))))
names(gene) <- values(gffgene)[, "locus_tag"]

## Translate + strand genes
pos <- values(gffgene[strand(gffgene) == "+"])[, "locus_tag"]
p1  <- translate(gene[names(gene) %in% pos])
names(p1) <- names(gene[names(gene) %in% pos])

## Translate - strand genes (reverse complement first)
neg <- values(gffgene[strand(gffgene) == "-"])[, "locus_tag"]
p2  <- translate(reverseComplement(gene[names(gene) %in% neg]))
names(p2) <- names(gene[names(gene) %in% neg])

## Write all proteins to FASTA
writeXStringSet(c(p1, p2), "./data/mypep.fasta")

Task B2 — Reduce overlapping gene ranges

Collapse overlapping gene ranges using the reduce() function and parse the resulting non-redundant sequences from the genome.

Required variable names:

Variable	Content
`gffgene_reduced`	`GRanges` result of `reduce(gffgene)`
`gene_reduced`	`DNAStringSet` of sequences for reduced ranges

Useful commands:

## Reduce overlapping ranges
gffgene_reduced <- reduce(gffgene)

## Parse sequences for reduced ranges
gene_reduced <- DNAStringSet(Views(chr[[1]],
                    IRanges(start(gffgene_reduced),
                            end(gffgene_reduced))))

Task B3 — Generate intergenic ranges and parse sequences

Compute the intergenic regions (gaps between genes) and parse their sequences from the genome.

Required variable names:

Variable	Content
`intergenic`	`GRanges` or `IRanges` of intergenic regions
`gene_intergenic`	`DNAStringSet` of intergenic sequences

Useful commands:

## Compute gaps between reduced gene ranges
intergenic <- gaps(gffgene_reduced)

## Parse intergenic sequences
gene_intergenic <- DNAStringSet(Views(chr[[1]],
                       IRanges(start(intergenic),
                               end(intergenic))))

Homework Submission

Submit one R script named HW5.R to your private GitHub homework repository at the following exact path:

Homework/HW5/HW5.R

Requirements for full credit

Your script must:

Define the function demultiplex_trim(x, barcode, nreads, cutoff = 20) with the quality trimming step implemented
Include a usage example showing how to call demultiplex_trim on the ShortRead test FASTQ files
Implement all three Task B steps using the required variable names (gffgene, gene, p1, p2, gffgene_reduced, gene_reduced, intergenic, gene_intergenic)
Write the protein FASTA output to data/mypep.fasta
Include brief comments explaining each step

Recommended script structure

## ─────────────────────────────────────────────────────────────
## HW5: NGS Analysis Basics
## GEN242, Spring 2026
## Author: <your name>
## ─────────────────────────────────────────────────────────────

library(ShortRead)
library(Biostrings)
library(GenomicRanges)
library(rtracklayer)

## ── Part A: Demultiplexing with quality trimming ──────────────

demultiplex_trim <- function(x, barcode, nreads, cutoff = 20) {
    ## ... your implementation ...
}

## Usage example
# fastq <- fastq <- list.files(system.file("extdata", "E-MTAB-1147", package = "ShortRead"), full.names=TRUE)
# demultiplex_trim(x = fastq[1], barcode = c("TT", "AA", "GG"),
#                 nreads = 50, cutoff = 20)

## ── Part B: Sequence parsing from GFF and genome ─────────────

dir.create("data", showWarnings = FALSE)

## Download data
# download.file("https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.gff", "data/AE004437.gff")
# download.file("https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.fna", "data/AE004437.fna")

## Task B1: Extract gene sequences and translate to protein
## ... your implementation using gffgene, gene, p1, p2 ...

## Task B2: Reduce overlapping gene ranges
## ... your implementation using gffgene_reduced, gene_reduced ...

## Task B3: Generate intergenic ranges
## ... your implementation using intergenic, gene_intergenic ...

Due Date

This homework is due Thursday, May 7th at 6:00 PM.

Homework Solutions

See here.

Overview

A. Demultiplexing with Quality Trimming

Background

Starter function (demultiplexing only)

Task A — Extend demultiplex to trim low-quality tails

B. Sequence Parsing from GFF and Genome

Background

Download data

Task B1 — Extract gene sequences and translate to protein

Task B2 — Reduce overlapping gene ranges

Task B3 — Generate intergenic ranges and parse sequences

Homework Submission

Requirements for full credit

Recommended script structure

Due Date

Homework Solutions

Task A — Extend `demultiplex` to trim low-quality tails