Shared big data space on biocluster

All larger data sets of the coure projects will be organized in a shared big data space under /bigdata/gen242/shared. Within this space, each group will read and write data to a subdirectory named after their project:

  • /bigdata/gen242/shared/RNA-Seq1
  • /bigdata/gen242/shared/RNA-Seq2
  • /bigdata/gen242/shared/ChIP-Seq1
  • /bigdata/gen242/shared/ChIP-Seq2
  • /bigdata/gen242/shared/VAR-Seq1

Within each project subdirectory all input files of a workflow (e.g. FASTQ) will be saved to a data directory and all output files will be written to a results directory.

GitHub repositories for projects

Students will work on the course projects within GitHub repositories, one for each course project. These project repositories are private and have been shared by the instructor with all members of each project group. To populate a course project with an initial project workflow, please follow the instruction given below.

Generate workflow environment with project data

  1. Log in to biocluster and set your working directory to bigdata
  2. Clone GitHub repository for your project with git clone ... (see here) and then cd into this directory.
  3. Generate workflow environment for your project on biocluster with genWorkenvir from systemPipeRdata.
  4. Replace the data and results directories by symbolic links pointing to the above described data and results directories of your course project. For instance, the project RNA-Seq1 should point on biocluster to:
    • /bigdata/gen242/shared/RNA-Seq1/data
    • /bigdata/gen242/shared/RNA-Seq1/results
  5. Add the workflow directory to the GitHub repository of your project with git add -A. Note, steps 1-4 need to be performed only by one student in each project. After committing and pushing the repository to GitHub, it can be cloned by all other students with git clone ....
  6. Download the FASTQ files of your project with getSRAfastq (see below) to the data directory of your project.
  7. Generate a proper targets file for your project where the first column(s) point(s) to the downloaded FASTQ files. In addition, provide sample names matching the experimental design (columns: SampleNames and Factor).
  8. Inspect and adjust the .param files you will be using. For instance, make sure the software modules you are loading and the path to the reference genome are correct.
  9. Every time you start working on your project you cd into the directory of the repository and then run git pull to get the latest change. When you are done, you commit and push your changes back to GitHub with git commit -am "some edits"; git push -u origin master.

Download of project data

FASTQ files from SRA

Choose FASTQ data for your project

  • The FASTQ files for the ChIP-Seq project are from SRA study SRP002174 (Kaufman et al. 2010)
    sraidv <- paste("SRR0388", 45:51, sep="") 
    
  • The FASTQ files for the RNA-Seq project are from SRA study SRP010938 (Howard et al. 2013)
    sraidv <- paste("SRR4460", 27:44, sep="")
    
  • The FASTQ files for the VAR-Seq project are from SRA study SRP008819 (Lu et al 2012)
    sraidv <- paste("SRR1051", 389:415, sep="")
    

Load libraries and modules

library(systemPipeR)
moduleload("sratoolkit/2.8.1")
system('fastq-dump --help') # prints help to screen

Redirect cache output of SRA Toolkit

Newer versions of the SRA Toolkit create a cache directory (named ncbi) in the highest level of a user’s home directory. To save space in your home account, you may want to redirect this output to your project’s data directory via a symbolic link. The following shows how to do this for the data directory of the ChIP-Seq1 project.

system("ln -s /bigdata/gen242/shared/ChIP-Seq1/data ~/ncbi")

Define download function

The following function downloads and extracts the FASTQ files for each project from SRA. Internally, it uses the fastq-dump utility of the SRA Toolkit from NCBI.

getSRAfastq <- function(sraid, targetdir, maxreads="1000000000") {
    system(paste("fastq-dump --split-files --gzip --maxSpotId", 
                  maxreads, sraid, "--outdir", targetdir))
}

Run download

Note the following performs the download in serialized mode for the chosen data set and saves the extracted FASTQ files to the path specified under targetdir.

mydir <- getwd(); setwd("data")
for(i in sraidv) getSRAfastq(sraid=i, targetdir=".")
setwd(mydir)

Alternatively, the download can be performed in parallelized mode with BiocParallel. Please run this version only on one of the compute nodes.

mydir <- getwd(); setwd("data")
# bplapply(sraidv, getSRAfastq, targetdir=".", BPPARAM = MulticoreParam(workers=4))
setwd(mydir)

Download reference genome and annotation

The following downloadRefs function downloads the Arabidopsis thaliana genome sequence and GFF file from the TAIR FTP site. It also assigns consistent chromosome identifiers to make them the same among both the genome sequence and the GFF file. This is important for many analysis routines such as the read counting in the RNA-Seq workflow.

downloadRefs <- function(rerun=FALSE) {
    if(rerun==TRUE) {
        library(Biostrings)
        system("wget ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas -P ./data/")
        dna <- readDNAStringSet("./data/TAIR10_chr_all.fas")
        names(dna) <- paste(rep("Chr", 7), c(1:5, "M", "C"), sep="") # Fixes chromomse ids
        writeXStringSet(dna, "./data/TAIR10_chr_all.fas")
        system("wget ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff -P ./data/")
        system("wget ftp://ftp.arabidopsis.org/home/tair/Proteins/TAIR10_functional_descriptions -P ./data/")
    }
}

After sourcing the above function, execute it as follows:

downloadRefs(rerun=FALSE) # To execute the function set 'rerun=TRUE'