Bioinformatics Exercises

Exercise 1

Download proteome of Halobacterium spec. with wget and look at it:
less AE004437.faa  # press q to quit

2. Simple Analysis:

a) How many predicted proteins are there?

grep '^>' AE004437.faa --count

b) How many proteins contain the pattern “WxHxxH” or “WxHxxHH”?

egrep 'W.H..H{1,2}' AE004437.faa --count

c) Use the find function (/) in ‘less’ to fish out the protein IDs containing the pattern or more elegantly do it with awk:

awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | less # press q to quit
3. Create a BLASTable database with formatdb:
ls # before
formatdb -i AE004437.faa -p T -o T
ls # after '-p F' for nucleotide and '-p T' for protein database; '-o T' parse SeqId and create indexes

4. Generate myseq.fasta

a) Generate list of sequence IDs for the above pattern match result

(i.e. retrieve my_IDs from step 2c). Alternatively, download the pre-generated file with wget.

b) Retrieve the corresponding sequences for these IDs with the fastacmd command from the blastable database:
fastacmd -d AE004437.faa -i my_IDs > myseq.fastaless myseq.fasta # press q to quit

5. Looking at several different patterns:

a) Generate several lists of sequence IDs from various pattern match results (i.e. retriev

a.my_idsb.my_ids, and  c.my_ids from step 2c).

b) Retrieve the sequences in one step using the fastacmd in a for-loop:
for i in *.my_ids; do fastacmd -d AE004437.faa -i $i > $i.fasta; done
6. Run blastall with a few proteins in myseq.fasta against your newly created Halobacterium proteome database.
Create first a complete blast output file  including alignments. In a second step use the ‘m -8’ option to obtain a tabular output (i.e. tab separated values).
blastall -p blastp -i myseq.fasta -d AE004437.faa -o blastp.out -e 1e-6 -v 10 -b 10
blastall -p blastp -i myseq.fasta -d AE004437.faa -m 8 -e 1e-6 >
less blastp.out # press q to quit
less -S # -S disables line wrapping, press q to quit

The filed descriptions of the Blast tabular output (from the “-m 8” option) are available here.

1  Query (The query sequence id)
2  Subject (The matching subject sequence id)
3  % id
4  alignment length
5  mismatches
6  gap openings
7  q.start
8  q.end
9  s.start
10 s.end
11 e-value
12 bit score
Exercise 2
  1. Split sample fasta batch file with csplit (use sequence file myseq.fasta from Exercise 1).
    csplit -z myseq.fasta '/>/' '{*}'
  2. Delete some of the files generated by csplit
  3. Concatenate single fasta files from (step 1) into to one file with cat (e.g. “cat file1 file2 file3 > bigfile”) .
  4. BLAST two related sequences, retrieve the result in tabular format and use “comm” to identify common hit IDs in the two tables.
Exercise 3
  • Create multiple alignment with ClustalW (e.g. use sequences with ‘W.H..HH’ pattern)
clustalw myseq.fasta
mv myseq.aln myalign.aln
Exercise 4
  • Reformat alignment into PHYILIP format using ‘seqret‘ from EMBOSS
seqret clustal::myalign.aln phylip::myalign.phylip
Exercise 5
  • Create neighbor-joining tree with PHYLIP
cp myalign.phylip infile
protdist # creates distance matrix (you may need to press 'R' and then 'Y')
cp outfile infileneighbor # use default settings (press 'Y')
cp outtree intree
retree # displays tree and can use midpoint method for defining root of tree, my typical command sequence is: 'N' (until you see PHYLIP) 'Y' 'M' 'W' 'R' 'R' 'X'

cp outtree tree.dnd

View your tree in TreeView