HW2 - Introduction to Biocluster and Linux
2 minute read
Topic: Linux Basics
-
Download code from this page
wget https://cluster.hpcc.ucr.edu/~tgirke/Linux.sh --no-check-certificate -
Download Halobacterium proteome and inspect it
wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/Halobacterium_salinarum/representative/GCA_004799605.1_ASM479960v1/GCA_004799605.1_ASM479960v1_protein.faa.gz gunzip GCA_004799605.1_ASM479960v1_protein.faa.gz mv GCA_004799605.1_ASM479960v1_protein.faa halobacterium.faa less halobacterium.faa # press q to quit -
How many protein sequences are stored in the downloaded file?
grep '>' halobacterium.faa | wc grep '^>' halobacterium.faa --count -
How many proteins contain the pattern
WxHxxHorWxHxxHH?egrep 'W.H..H{1,2}' halobacterium.faa --count -
Use
lessto find IDs for pattern matches or useawkawk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | less awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | grep '^>' | cut -c 2- | cut -f 1 -d\ > myIDs -
Create a BLASTable database with
formatdbmodule load ncbi-blast makeblastdb -in halobacterium.faa -out halobacterium.faa -dbtype prot -hash_index -parse_seqids -
Query BLASTable database by IDs stored in a file (e.g.
myIDs)blastdbcmd -db halobacterium.faa -dbtype prot -entry_batch myIDs -get_dups -out myseq.fasta -
Run BLAST search for sequences stored in
myseq.fastablastp -query myseq.fasta -db halobacterium.faa -outfmt 0 -evalue 1e-6 -out blastp.out blastp -query myseq.fasta -db halobacterium.faa -outfmt 6 -evalue 1e-6 -out blastp.tab -
Return system time and host name
date hostname
Additional exercise material in Linux Manual
Homework assignment
Perform above analysis on the protein sequences from E. coli. A right click on the link will allow you to copy the URL so that it can be used together with wget.
Record result from final BLAST command (with outfmt 6) in text file.
Homework submission
Submit your homework to GEN242-2021 HW2 on GitHub Classroom by following these stepwise instructions:
- Upload your script and name it
hw2.sh. - Upload the unzipped faa file from step 1, name it
ecoli.faa. - Upload IDs from step 5 in a file named
myIDs. - Upload the final file generated with
outfmt 6from step 8, and name itecoli.txt.
Due date
Most homeworks will be due one week after they are assigned. This one is due on Thu, April 8th at 6:00 PM.
Homework solution
See here.