HW6 - NGS Analysis Basics

2 minute read

Source code downloads: [ .R ]

A. Demultiplexing

Write a demultiplexing function that accepts any number of barcodes and splits a FASTQ file into as many subfiles as there are barcodes. At the same time the function should remove low quality tails from the reads. In both cases arguments should be provided so that the barcodes and cutoff can be specified by the user. The following function accomplishes the first step. Extend this function so that it performs the second step as well. As test data set one can use the FASTQ test files downloaded in the corresponding tutorial section here.

demultiplex <- function(x, barcode, nreads) {
	f <- FastqStreamer(x, nreads) 
	while(length(fq <- yield(f))) {
		for(i in barcode) {
			pattern <- paste("^", i, sep="")
			fqsub <- fq[grepl(pattern, sread(fq))] 
			if(length(fqsub) > 0) {
				writeFastq(fqsub, paste(x, i, sep="_"), mode="a", compress=FALSE)
			}
		}
	}
	close(f)
}
demultiplex(x=fastq[1], barcode=c("TT", "AA", "GG"), nreads=50)

B. Sequence Parsing

Download GFF from Halobacterium sp here
Download genome sequence from Halobacterium sp here
Task 1 Extract gene ranges, parse their sequences from genome and translate them into proteins
Task 2 Reduce overlapping genes and parse their sequences from genome
Task 3 Generate intergenic ranges and parse their sequences from genome

Useful commands

download.file("https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.gff", "data/AE004437.gff")
download.file("https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.fna", "data/AE004437.fna")
chr <- readDNAStringSet("data/AE004437.fna")
gff <- import("data/AE004437.gff")
gffgene <- gff[values(gff)[,"type"]=="gene"]
gene <- DNAStringSet(Views(chr[[1]], IRanges(start(gffgene), end(gffgene))))
names(gene) <- values(gffgene)[,"locus_tag"]
pos <- values(gffgene[strand(gffgene) == "+"])[,"locus_tag"]
p1 <- translate(gene[names(gene) %in% pos])
names(p1) <- names(gene[names(gene) %in% pos])
neg <- values(gffgene[strand(gffgene) == "-"])[,"locus_tag"]
p2 <- translate(reverseComplement(gene[names(gene) %in% neg]))
names(p2) <- names(gene[names(gene) %in% neg])
writeXStringSet(c(p1, p2), "./data/mypep.fasta")

Homework submission

Please submit the homework results in one well structured and annotated R script to your private GitHub repository under Homework/HW6/HW6.R. The script should include instructions on how to use the functions.

Due date

This homework will be due Tue, May 06 at 6:00 PM.

Homework Solutions

See here.

Last modified 2025-05-26: some edits (1747f3b)