HW01 — Introduction to Linux and HPC
Topic: Linux Basics
- Log into your user account on the HPCC cluster, and from there into a compute node with
srun. Note, the following code is also available in pure text format here. The following command may return aDISPLAYerror if X11 is not enabled in a session. If this happens, drop the argument--x11in the below command. Users logging in from a macOS computer, need to have XQuartz installed for X11 support (see here).
- Download code from this page
- Download Halobacterium proteome and inspect it
wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/Halobacterium_salinarum/all_assembly_versions/GCA_004799605.1_ASM479960v1/GCA_004799605.1_ASM479960v1_protein.faa.gz
gunzip GCA_004799605.1_ASM479960v1_protein.faa.gz
mv GCA_004799605.1_ASM479960v1_protein.faa halobacterium.faa
less halobacterium.faa # press q to quit- How many protein sequences are stored in the downloaded file?
- How many proteins contain the pattern
WxHxxHorWxHxxHH?
- Use
lessto find IDs for pattern matches or useawk
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | less
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | grep '^>' | cut -c 2- | cut -f 1 -d\ > myIDs- Create a BLASTable database with
formatdb
module load ncbi-blast/2.2.31+
makeblastdb -in halobacterium.faa -out halobacterium.faa -dbtype prot -hash_index -parse_seqids- Query BLASTable database by IDs stored in a file (e.g.
myIDs)
- Run BLAST search for sequences stored in
myseq.fasta
blastp -query myseq.fasta -db halobacterium.faa -outfmt 0 -evalue 1e-6 -out blastp.out
blastp -query myseq.fasta -db halobacterium.faa -outfmt 6 -evalue 1e-6 -out blastp.tab- Return system time and host name
Additional exercise material in HPCC Linux Manual
Homework assignment
Perform above analysis on the protein sequences from E. coli. A right click on the link will allow you to copy the URL so that it can be used together with wget. Record result from final BLAST command (with outfmt 6) in text file named myresult.txt.
Homework submission
Upload result file (myresult.txt) to your private course GitHub repository under Homework/HW1/HW1.txt.
Due date
Most homeworks will be due one week after they are assigned. This one is due on Thu, April 10th at 6:00 PM.
Homework solution
Willl posted after due date.