Project Data Management
Overview
The following explains how to download the full RNA-Seq data set from Howard et al. 2013 (SRP010938) that is used by Expression Profiling Group in GEN242. The download information of the data sets used by the other groups is described in the corresponding project outlines (Google Docs). Some of the tutorial pages provide additional useful information, including RNA-Seq, scRNA-Seq and ML/AI.
The first paragraphs below describe how to set up a project environments on the HPCC, which is useful for all projects, not just RNA-Seq. This is followed by the actual data downloads specific to the RNA-Seq project including FASTQ and reference genome files.
Big data space on HPCC
All larger data sets of the course projects will be organized in a big data space under /bigdata/gen242/<user_name>. Within this space, each student will work in a subdirectory named after their project:
/bigdata/gen242/<user_name>/<first_last_name>_project
Project GitHub repositories
Students will work on their course projects within GitHub repositories, one for each student. These project repositories are private and have been shared with each student. To populate a course project with an initial project workflow, please follow the instructions given in the following section.
Generate workflow environment with real project data
Log in to the HPCC cluster and set your working directory to
bigdataor (/bigdata/gen242/<user_name>)Clone the GitHub repository for your project with
git clone ...(URLs are listed inCourse Planningsheet) and thencdinto this directory. As mentioned above, the project GitHub repos follow this naming convention:<first_last_name>_project.Generate the workflow environment for your project on the HPCC cluster with
genWorkenvir_ghas outlined here.Next,
cdinto the directory of your workflow, delete its defaultdataandresultsdirectories, and then substitute them with empty directories outside of your project GitHub repos as follows (<workflow>needs to be replaced with actual workflow name):Within your workflow directory create symbolic links pointing to the new directories created in the previous step. For instance, the projects using the RNA-Seq workflow should create the symbolic links for their
dataandresultsdirectories like this (<user_name>and<workflow>needs to be replaced with your user name and workflow name):Add the workflow directory to the GitHub repository of your project with
git add -Aand then runcommitandpushas outlined in the GitHub instructions of this course here. After this check whether the workflow directory and its content shows up on your project’s online repos on GitHub. Very important: make sure that thedataandresultsare empty at this point. If not investigate why and fix the problem in the corresponding step above.Download the FASTQ files of your project with
getSRAfastq(see below) to thedatadirectory of your project.Generate a proper
targetsfile for your project where the first column(s) point(s) to the downloaded FASTQ files. In addition, provide sample names matching the experimental design (columns:SampleNamesandFactor). More details about the structure of targets files are provided here. Ready to use targets files for the RNA-Seq project can be downloaded as tab separated (TSV) file from here. Alternatively, one can download the corresponding Google Sheets with theread_sheetfunction from thegooglesheets4package (RNA-Seq GSheet). Importantly, thetargetsPE.txtfile provided by thesprwf-rnaseq-gen242-2026repos can be used without any changes for the Group 1 project.Inspect and adjust the
.paramfiles you will be using. For instance, make sure the software modules you are loading and the path to the reference genome are correct.Every time you start working on your project you
cdinto the directory of the repository and then rungit pullto get the latest changes. When you are done, you commit and push your changes back to GitHub withgit commit -am "some edits"; git push.
Download of project data
After logging in to one of the computer nodes via srun, open R from within the GitHub repository of your project and then run the following code section, but only those that apply to your project.
FASTQ files from SRA
Create FASTQ ID list
- The FASTQ files for the RNA-Seq project are from SRA study SRP010938 (Howard et al. 2013)
Load libraries and modules
library(systemPipeR)
moduleload("sratoolkit/3.0.0")
system("vdb-config --prefetch-to-cwd") # sets download default to current directory
# system('prefetch --help') # helps to speed up fastq-dump
# system('vdb-config -i') # allows to change SRA Toolkit configuration; instructions are here: https://bit.ly/3lzfU4P
# system('fastq-dump --help') # below uses this one for backwards compatibility
# system('fasterq-dump --help') # faster than fastq-dumpDefine download function
The following function downloads and extracts the FASTQ files for each project from SRA. Internally, it uses the prefetch and fastq-dump utilities of the SRA Toolkit from NCBI. The faster fasterq-dump alternative (see comment line below) is not used here for historical reasons. Note, if you use the SRA Toolkit in your HPCC user account for the first time, then it might ask you to configure it by running vdb-config --interactive from the command-line. In the resulting dialog, one can keep the default settings, and then save and exit. By running prior to any FASTQ file downloads vdb-config --prefetch-to-cwd, the download location will be set to the current working directory (see above).
getSRAfastq <- function(sraid, threads=1) {
system(paste("prefetch", sraid)) # makes download faster
system(paste("vdb-validate", sraid)) # checks integrity of the downloaded SRA file
system(paste("fastq-dump --split-files --gzip", sraid)) # gzip option makes it slower but saves storage space
# system(paste("fasterq-dump --threads 4 --split-files --progress ", sraid, "--outdir .")) # Faster alternative to fastq-dump
unlink(x=sraid, recursive = TRUE, force = TRUE) # deletes sra download directory
}To stop the loop after a failure is detected by vdb-validate, use && operator like this: prefetch sraid && vdb-validate sraid && fastq-dump sraid.
Run download
Note the following performs the download in serialized mode for the chosen data set and saves the extracted FASTQ files to the current working directory.
mydir <- getwd(); setwd("data")
for(i in sraidv) getSRAfastq(sraid=i)
setwd(mydir)
## Check whether all FASTQ files were downloaded
downloaded_files <- list.files('./data', pattern='fastq.gz$')
all(sraidv %in% gsub("_.*", "", downloaded_files)) # Should be TRUEAlternatively, the download can be performed in parallelized mode with BiocParallel. Please run this version only on a compute node.
Avoid FASTQ download
To save time, skip the download of the FASTQ files. Instead generate in the data directory of your workflow symlinks to already downloaded FASTQ files.
fastq_symlink <- function(workflow) {
file_paths <- list.files(file.path("/bigdata/gen242/data", workflow, "data"), pattern='fastq.gz$', full.names=TRUE)
for(i in seq_along(file_paths)) system(paste0("ln -s ", file_paths[i], " ./data/", basename(file_paths[i])))
}
workflow_type <- 'fastq_rnaseq' # Choose here correct workflow
fastq_symlink(workflow=workflow_type)If needed, the precomputed count table can be accessed via this path: /bigdata/gen242/data/fastq_rnaseq/results/countDFeByg.xls.
Download reference genome and annotation
The following downloadRefs function downloads the Arabidopsis thaliana genome sequence and GFF file from the TAIR FTP site. It also assigns consistent chromosome identifiers to make them the same among both the genome sequence and the GFF file. This is important for many analysis routines such as the read counting in the RNA-Seq workflow.
downloadRefs <- function(rerun=FALSE) {
if(rerun==TRUE) {
library(Biostrings)
download.file("https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz", "./data/tair10.fasta.gz")
R.utils::gunzip("./data/tair10.fasta.gz")
download.file("https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gff3.gz", "./data/tair10.gff.gz")
R.utils::gunzip("./data/tair10.gff.gz")
txdb <- GenomicFeatures::makeTxDbFromGFF(file = "data/tair10.gff", format = "gff", dataSource = "TAIR", organism = "Arabidopsis thaliana")
AnnotationDbi::saveDb(txdb, file="./data/tair10.sqlite")
download.file("https://cluster.hpcc.ucr.edu/~tgirke/Teaching/GEN242/data/tair10_functional_descriptions", "./data/tair10_functional_descriptions")
}
}After importing/sourcing the above function, execute it as follows: