Data Management for Course Projects
5 minute read
Shared big data space on HPCC
All larger data sets of the coure projects will be organized in a big data space under
/bigdata/gen242/<user_name>
. Within this space, each student will read and write data to a
subdirectory named after their project:
/bigdata/gen242/<user_name>/projdata
Within each projdata
directory all input files of a workflow (e.g. FASTQ) will be saved to
a data
directory and all output files will be written to a results
directory. To set up the proper
directory structure, cd
into /bigdata/gen242/<user_name>
, create the directory named projdata
and then within this directory create the data
and results
subdirectories. The full path to these
directories should look like this:
/bigdata/gen242/<user_name>/projdata/data
/bigdata/gen242/<user_name>/projdata/results
GitHub repositories for projects
Students will work on their course projects within GitHub repositories, one for each student. These project repositories are private and have been shared with each student via GitHub Classroom. To populate a course project with an initial project workflow, please follow the instructions given.
Generate workflow environment with project data
- Log in to the HPCC cluster and set your working directory to
bigdata
or (/bigdata/gen242/<user_name>
) - Clone the GitHub repository for your project with
git clone ...
(URLs listed here) and thencd
into this directory. - Generate the workflow environment for your project on the HPCC cluster with
genWorkenvir
fromsystemPipeRdata
. - Delete the default
data
andresults
directories and replace them with symbolic links pointing to the above describeddata
andresults
directories of your course project. For instance, the project RNA-Seq should create the symbolic links for theirdata
andresults
directories like this:ln -s /bigdata/gen242/<user_name>/projdata/data data ln -s /bigdata/gen242/<user_name>/projdata/results results
- Add the workflow directory to the GitHub repository of your project with
git add -A
and the runcommit
andpush
as outlined in the GitHub instructions of this course here. - Download the FASTQ files of your project with
getSRAfastq
(see below) to thedata
directory of your project, here ‘/bigdata/gen242/<user_name>/projdata/data’. - Generate a proper
targets
file for your project where the first column(s) point(s) to the downloaded FASTQ files. In addition, provide sample names matching the experimental design (columns:SampleNames
andFactor
). More details about the structure of targets files are provided here. Ready to use targets files for both the RNA-Seq and ChIP-Seq project can be downloaded as tab separated (TSV) files from here. Alternatively, one can download the corresponding Google Sheets with theread_sheet
function from thegooglesheets4
package (RNA-Seq GSheet and ChIP-Seq GSheet). - Inspect and adjust the
.param
files you will be using. For instance, make sure the software modules you are loading and the path to the reference genome are correct. - Every time you start working on your project you
cd
into the directory of the repository and then rungit pull
to get the latest change. When you are done, you commit and push your changes back to GitHub withgit commit -am "some edits"; git push -u origin main
.
Download of project data
Open R from within the GitHub respository of your project and then run the following code section, but only those that apply to your project.
FASTQ files from SRA
Choose FASTQ data for your project
- The FASTQ files for the ChIP-Seq project are from SRA study SRP002174 (Kaufman et al. 2010)
sraidv <- paste("SRR0388", 45:51, sep="")
- The FASTQ files for the RNA-Seq project are from SRA study SRP010938 (Howard et al. 2013)
sraidv <- paste("SRR4460", 27:44, sep="")
Load libraries and modules
library(systemPipeR)
moduleload("sratoolkit/2.9.2")
system('fastq-dump --help') # prints help to screen
Redirect cache output of SRA Toolkit
Newer versions of the SRA Toolkit create a cache directory (named ncbi
) in the highest level of a user’s home directory.
To save space in home accounts (limited to 20GB), users need to redirect this output to their project’s
data
directory via a symbolic link. The following shows how to do this for the data
directory
of the ChIP-Seq
project.
system("ln -s /bigdata/gen242/<user_name>/projdata/data ~/ncbi")
Define download function
The following function downloads and extracts the FASTQ files for each project from SRA.
Internally, it uses the fastq-dump
utility of the SRA Toolkit from NCBI.
getSRAfastq <- function(sraid, targetdir, maxreads="1000000000") {
system(paste("fastq-dump --split-files --gzip --maxSpotId",
maxreads, sraid, "--outdir", targetdir))
}
Run download
Note the following performs the download in serialized mode for the chosen data set and saves the extracted FASTQ files to
the path specified under targetdir
.
mydir <- getwd(); setwd("data")
for(i in sraidv) getSRAfastq(sraid=i, targetdir=".")
setwd(mydir)
Alternatively, the download can be performed in parallelized mode with BiocParallel
. Please run this version only on one of the compute nodes.
mydir <- getwd(); setwd("data")
# bplapply(sraidv, getSRAfastq, targetdir=".", BPPARAM = MulticoreParam(workers=4))
setwd(mydir)
Download reference genome and annotation
The following downloadRefs
function downloads the Arabidopsis thaliana genome sequence and GFF file from the TAIR FTP site.
It also assigns consistent chromosome identifiers to make them the same among both the genome sequence and the GFF file. This is
important for many analysis routines such as the read counting in the RNA-Seq workflow.
downloadRefs <- function(rerun=FALSE) {
if(rerun==TRUE) {
library(Biostrings)
download.file("https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas", "./data/tair10.fasta")
dna <- readDNAStringSet("./data/tair10.fasta")
names(dna) <- paste(rep("Chr", 7), c(1:5, "M", "C"), sep="") # Fixes chromomse ids
writeXStringSet(dna, "./data/tair10.fasta")
download.file("https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff", "./data/tair10.gff")
download.file("https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_functional_descriptions", "./data/tair10_functional_descriptions")
}
}
After importing/sourcing the above function, execute it as follows:
downloadRefs(rerun=TRUE)