Data Management for Course Projects
5 minute read
Shared big data space on HPCC
All larger data sets of the coure projects will be organized in a big data space under
/bigdata/gen242/<user_name>. Within this space, each student will read and write data to a
subdirectory named after their project:
/bigdata/gen242/<user_name>/projdata
Within each projdata directory all input files of a workflow (e.g. FASTQ) will be saved to
a data directory and all output files will be written to a results directory. To set up the proper
directory structure, cd into /bigdata/gen242/<user_name>, create the directory named projdata
and then within this directory create the data and results subdirectories. The full path to these
directories should look like this:
/bigdata/gen242/<user_name>/projdata/data/bigdata/gen242/<user_name>/projdata/results
GitHub repositories for projects
Students will work on their course projects within GitHub repositories, one for each student. These project repositories are private and have been shared with each student via GitHub Classroom. To populate a course project with an initial project workflow, please follow the instructions given.
Generate workflow environment with project data
- Log in to the HPCC cluster and set your working directory to
bigdataor (/bigdata/gen242/<user_name>) - Clone the GitHub repository for your project with
git clone ...(URLs listed here) and thencdinto this directory. - Generate the workflow environment for your project on the HPCC cluster with
genWorkenvirfromsystemPipeRdata. - Delete the default
dataandresultsdirectories and replace them with symbolic links pointing to the above describeddataandresultsdirectories of your course project. For instance, the project RNA-Seq should create the symbolic links for theirdataandresultsdirectories like this:ln -s /bigdata/gen242/<user_name>/projdata/data data ln -s /bigdata/gen242/<user_name>/projdata/results results - Add the workflow directory to the GitHub repository of your project with
git add -Aand the runcommitandpushas outlined in the GitHub instructions of this course here. - Download the FASTQ files of your project with
getSRAfastq(see below) to thedatadirectory of your project, here ‘/bigdata/gen242/<user_name>/projdata/data’. - Generate a proper
targetsfile for your project where the first column(s) point(s) to the downloaded FASTQ files. In addition, provide sample names matching the experimental design (columns:SampleNamesandFactor). More details about the structure of targets files are provided here. Ready to use targets files for both the RNA-Seq and ChIP-Seq project can be downloaded as tab separated (TSV) files from here. Alternatively, one can download the corresponding Google Sheets with theread_sheetfunction from thegooglesheets4package (RNA-Seq GSheet and ChIP-Seq GSheet). - Inspect and adjust the
.paramfiles you will be using. For instance, make sure the software modules you are loading and the path to the reference genome are correct. - Every time you start working on your project you
cdinto the directory of the repository and then rungit pullto get the latest change. When you are done, you commit and push your changes back to GitHub withgit commit -am "some edits"; git push -u origin main.
Download of project data
Open R from within the GitHub respository of your project and then run the following code section, but only those that apply to your project.
FASTQ files from SRA
Choose FASTQ data for your project
- The FASTQ files for the ChIP-Seq project are from SRA study SRP002174 (Kaufman et al. 2010)
sraidv <- paste("SRR0388", 45:51, sep="")
- The FASTQ files for the RNA-Seq project are from SRA study SRP010938 (Howard et al. 2013)
sraidv <- paste("SRR4460", 27:44, sep="")
Load libraries and modules
library(systemPipeR)
moduleload("sratoolkit/2.9.2")
system('fastq-dump --help') # prints help to screen
Redirect cache output of SRA Toolkit
Newer versions of the SRA Toolkit create a cache directory (named ncbi) in the highest level of a user’s home directory.
To save space in home accounts (limited to 20GB), users need to redirect this output to their project’s
data directory via a symbolic link. The following shows how to do this for the data directory
of the ChIP-Seq project.
system("ln -s /bigdata/gen242/<user_name>/projdata/data ~/ncbi")
Define download function
The following function downloads and extracts the FASTQ files for each project from SRA.
Internally, it uses the fastq-dump utility of the SRA Toolkit from NCBI.
getSRAfastq <- function(sraid, targetdir, maxreads="1000000000") {
system(paste("fastq-dump --split-files --gzip --maxSpotId",
maxreads, sraid, "--outdir", targetdir))
}
Run download
Note the following performs the download in serialized mode for the chosen data set and saves the extracted FASTQ files to
the path specified under targetdir.
mydir <- getwd(); setwd("data")
for(i in sraidv) getSRAfastq(sraid=i, targetdir=".")
setwd(mydir)
Alternatively, the download can be performed in parallelized mode with BiocParallel. Please run this version only on one of the compute nodes.
mydir <- getwd(); setwd("data")
# bplapply(sraidv, getSRAfastq, targetdir=".", BPPARAM = MulticoreParam(workers=4))
setwd(mydir)
Download reference genome and annotation
The following downloadRefs function downloads the Arabidopsis thaliana genome sequence and GFF file from the TAIR FTP site.
It also assigns consistent chromosome identifiers to make them the same among both the genome sequence and the GFF file. This is
important for many analysis routines such as the read counting in the RNA-Seq workflow.
downloadRefs <- function(rerun=FALSE) {
if(rerun==TRUE) {
library(Biostrings)
download.file("https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas", "./data/tair10.fasta")
dna <- readDNAStringSet("./data/tair10.fasta")
names(dna) <- paste(rep("Chr", 7), c(1:5, "M", "C"), sep="") # Fixes chromomse ids
writeXStringSet(dna, "./data/tair10.fasta")
download.file("https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff", "./data/tair10.gff")
download.file("https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_functional_descriptions", "./data/tair10_functional_descriptions")
}
}
After importing/sourcing the above function, execute it as follows:
downloadRefs(rerun=TRUE)