HW2 - Introduction to Biocluster and Linux

2 minute read

Topic: Linux Basics

  1. Log into your user account on the HPCC cluster, and from there into a compute node with srun.

    srun --x11 --partition=short --mem=2gb --cpus-per-task 4 --ntasks 1 --time 1:00:00 --pty bash -l
    
  2. Download code from this page

    wget https://cluster.hpcc.ucr.edu/~tgirke/Linux.sh --no-check-certificate 
    
  3. Download Halobacterium proteome and inspect it

    wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/Halobacterium_salinarum/all_assembly_versions/GCA_004799605.1_ASM479960v1/GCA_004799605.1_ASM479960v1_protein.faa.gz
    gunzip GCA_004799605.1_ASM479960v1_protein.faa.gz
    mv GCA_004799605.1_ASM479960v1_protein.faa halobacterium.faa
    less halobacterium.faa # press q to quit
    
  4. How many protein sequences are stored in the downloaded file?

    grep '>' halobacterium.faa | wc
    grep '^>' halobacterium.faa --count
    
  5. How many proteins contain the pattern WxHxxH or WxHxxHH?

    egrep 'W.H..H{1,2}' halobacterium.faa --count
    
  6. Use less to find IDs for pattern matches or use awk

    awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | less
    awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' halobacterium.faa | grep '^>' | cut -c 2- | cut -f 1 -d\ > myIDs
    
  7. Create a BLASTable database with formatdb

    module load ncbi-blast/2.2.31+
    makeblastdb -in halobacterium.faa -out halobacterium.faa -dbtype prot -hash_index -parse_seqids
    
  8. Query BLASTable database by IDs stored in a file (e.g. myIDs)

    blastdbcmd -db halobacterium.faa -dbtype prot -entry_batch myIDs -get_dups -out myseq.fasta
    
  9. Run BLAST search for sequences stored in myseq.fasta

    blastp -query myseq.fasta -db halobacterium.faa -outfmt 0 -evalue 1e-6 -out blastp.out
    blastp -query myseq.fasta -db halobacterium.faa -outfmt 6 -evalue 1e-6 -out blastp.tab
    
  10. Return system time and host name

    date
    hostname
    

Additional exercise material in Linux Manual

Homework assignment

Perform above analysis on the protein sequences from E. coli. A right click on the link will allow you to copy the URL so that it can be used together with wget. Record result from final BLAST command (with outfmt 6) in text file named myresult.txt.

Homework submission

Upload result file (myresult.txt) to your private course GitHub repository under Homework/HW2/HW2.txt.

Due date

Most homeworks will be due one week after they are assigned. This one is due on Thu, April 11th at 6:00 PM.

Homework solution

To be posted.

Last modified 2024-03-23: some edits (1178906e4)