HW2 - Introduction to Biocluster and Linux

2 minute read

Topic: Linux Basics

Download code from this page

wget https://cluster.hpcc.ucr.edu/~tgirke/Linux.sh --no-check-certificate

Download Halobacterium proteome and inspect it

wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/Halobacterium_salinarum/representative/GCA_004799605.1_ASM479960v1/GCA_004799605.1_ASM479960v1_protein.faa.gz
gunzip GCA_004799605.1_ASM479960v1_protein.faa.gz
mv GCA_004799605.1_ASM479960v1_protein.faa halobacterium.faa
less halobacterium.faa # press q to quit

How many protein sequences are stored in the downloaded file?

grep '>' halobacterium.faa | wc
grep '^>' halobacterium.faa --count

How many proteins contain the pattern WxHxxH or WxHxxHH?
```
egrep 'W.H..H{1,2}' halobacterium.faa --count
```

Use less to find IDs for pattern matches or use awk

awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | less
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | grep '^>' | cut -c 2- | cut -f 1 -d\ > myIDs

Create a BLASTable database with formatdb

module load ncbi-blast
makeblastdb -in halobacterium.faa -out halobacterium.faa -dbtype prot -hash_index -parse_seqids

Query BLASTable database by IDs stored in a file (e.g. myIDs)

blastdbcmd -db halobacterium.faa -dbtype prot -entry_batch myIDs -get_dups -out myseq.fasta

Run BLAST search for sequences stored in myseq.fasta

blastp -query myseq.fasta -db halobacterium.faa -outfmt 0 -evalue 1e-6 -out blastp.out
blastp -query myseq.fasta -db halobacterium.faa -outfmt 6 -evalue 1e-6 -out blastp.tab

Return system time and host name
```
date
hostname
```

Additional exercise material in Linux Manual

Homework assignment

Perform above analysis on the protein sequences from E. coli. A right click on the link will allow you to copy the URL so that it can be used together with wget. Record result from final BLAST command (with outfmt 6) in text file.

Homework submission

Submit your homework to GEN242-2021 HW2 on GitHub Classroom by following these stepwise instructions:

Upload your script and name it hw2.sh.
Upload the unzipped faa file from step 1, name it ecoli.faa.
Upload IDs from step 5 in a file named myIDs.
Upload the final file generated with outfmt 6 from step 8, and name it ecoli.txt.

Due date

Most homeworks will be due one week after they are assigned. This one is due on Thu, April 8th at 6:00 PM.

Homework solution

See here.

Last modified 2021-04-10: some edits (ebac67cdc)