Shared big data space on biocluster
All larger data sets of the coure projects will be organized in a shared big data space under
/bigdata/gen242/shared
. Within this space, each group will read and write data to a
subdirectory named after their project:
/bigdata/gen242/shared/RNA-Seq1
/bigdata/gen242/shared/RNA-Seq2
/bigdata/gen242/shared/ChIP-Seq1
/bigdata/gen242/shared/ChIP-Seq2
/bigdata/gen242/shared/VAR-Seq1
Within each project subdirectory all input files of a workflow (e.g. FASTQ) will be saved to
a data
directory and all output files will be written to a results
directory.
GitHub repositories for projects
Students will work on the course projects within GitHub repositories, one for each course project. These project repositories are private and have been shared by the instructor with all members of each project group. To populate a course project with an initial project workflow, please follow the instructions given below.
Generate workflow environment with project data
- Log in to biocluster and set your working directory to
bigdata
or (/bigdata/gen242/<user_name>
) - Clone the GitHub repository for your project with
git clone ...
(see here) and thencd
into this directory. - Generate the workflow environment for your project on biocluster with
genWorkenvir
fromsystemPipeRdata
. - Delete the default
data
andresults
directories and replace them with symbolic links pointing to the above describeddata
andresults
directories of your course project. For instance, the project RNA-Seq1 should create the symbolic links for theirdata
andresults
directories like this:ln -s /bigdata/gen242/shared/RNA-Seq1/data data ln -s /bigdata/gen242/shared/RNA-Seq1/results results
- Add the workflow directory to the GitHub repository of your project with
git add -A
and the runcommit
andpush
as outlined in the GitHub instructions of this course here. Note, steps 1-4 need to be performed only by one student in each project. After committing and pushing the repository to GitHub, it can be cloned by all other students withgit clone ...
. - Download the FASTQ files of your project with
getSRAfastq
(see below) to thedata
directory of your project. - Generate a proper
targets
file for your project where the first column(s) point(s) to the downloaded FASTQ files. In addition, provide sample names matching the experimental design (columns:SampleNames
andFactor
). - Inspect and adjust the
.param
files you will be using. For instance, make sure the software modules you are loading and the path to the reference genome are correct. - Every time you start working on your project you
cd
into the directory of the repository and then rungit pull
to get the latest change. When you are done, you commit and push your changes back to GitHub withgit commit -am "some edits"; git push -u origin master
.
Download of project data
Open R from within the GitHub respository of your project and then run the following code section, but only those that apply to your project.
FASTQ files from SRA
Choose FASTQ data for your project
- The FASTQ files for the ChIP-Seq project are from SRA study SRP002174 (Kaufman et al. 2010)
sraidv <- paste("SRR0388", 45:51, sep="")
- The FASTQ files for the RNA-Seq project are from SRA study SRP010938 (Howard et al. 2013)
sraidv <- paste("SRR4460", 27:44, sep="")
- The FASTQ files for the VAR-Seq project are from SRA study SRP008819 (Lu et al 2012)
sraidv <- paste("SRR1051", 389:415, sep="")
Load libraries and modules
library(systemPipeR)
moduleload("sratoolkit/2.8.1")
system('fastq-dump --help') # prints help to screen
Redirect cache output of SRA Toolkit
Newer versions of the SRA Toolkit create a cache directory (named ncbi
) in the highest level of a user’s home directory.
To save space in home accounts (limited to 20GB), users need to redirect this output to their project’s
data
directory via a symbolic link. The following shows how to do this for the data
directory
of the ChIP-Seq1
project.
system("ln -s /bigdata/gen242/shared/ChIP-Seq1/data ~/ncbi")
Define download function
The following function downloads and extracts the FASTQ files for each project from SRA.
Internally, it uses the fastq-dump
utility of the SRA Toolkit from NCBI.
getSRAfastq <- function(sraid, targetdir, maxreads="1000000000") {
system(paste("fastq-dump --split-files --gzip --maxSpotId",
maxreads, sraid, "--outdir", targetdir))
}
Run download
Note the following performs the download in serialized mode for the chosen data set and saves the extracted FASTQ files to
the path specified under targetdir
.
mydir <- getwd(); setwd("data")
for(i in sraidv) getSRAfastq(sraid=i, targetdir=".")
setwd(mydir)
Alternatively, the download can be performed in parallelized mode with BiocParallel
. Please run this version only on one of the compute nodes.
mydir <- getwd(); setwd("data")
# bplapply(sraidv, getSRAfastq, targetdir=".", BPPARAM = MulticoreParam(workers=4))
setwd(mydir)
Download reference genome and annotation
The following downloadRefs
function downloads the Arabidopsis thaliana genome sequence and GFF file from the TAIR FTP site.
It also assigns consistent chromosome identifiers to make them the same among both the genome sequence and the GFF file. This is
important for many analysis routines such as the read counting in the RNA-Seq workflow.
downloadRefs <- function(rerun=FALSE) {
if(rerun==TRUE) {
library(Biostrings)
download.file("ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas", "./data/tair10.fasta")
dna <- readDNAStringSet("./data/tair10.fasta")
names(dna) <- paste(rep("Chr", 7), c(1:5, "M", "C"), sep="") # Fixes chromomse ids
writeXStringSet(dna, "./data/tair10.fasta")
download.file("ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff", "./data/tair10.gff")
download.file("ftp://ftp.arabidopsis.org/home/tair/Proteins/TAIR10_functional_descriptions", "./data/tair10_functional_descriptions")
}
}
After importing/sourcing the above function, execute it as follows:
downloadRefs(rerun=TRUE)