Bioinformatics Exercises | Computational Biology Core

Exercise 1

Download proteome of Halobacterium spec. with wget and look at it:

wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.faa

less AE004437.faa  # press q to quit

2. Simple Analysis:

a) How many predicted proteins are there?

grep '^>' AE004437.faa --count

b) How many proteins contain the pattern “WxHxxH” or “WxHxxHH”?

egrep 'W.H..H{1,2}' AE004437.faa --count

c) Use the find function (/) in ‘less’ to fish out the protein IDs containing the pattern or more elegantly do it with awk:

awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | less # press q to quit

3. Create a BLASTable database with formatdb:

ls # before

formatdb -i AE004437.faa -p T -o T

ls # after '-p F' for nucleotide and '-p T' for protein database; '-o T' parse SeqId and create indexes

4. Generate myseq.fasta

a) Generate list of sequence IDs for the above pattern match result

(i.e. retrieve my_IDs from step 2c). Alternatively, download the pre-generated file with wget.

b) Retrieve the corresponding sequences for these IDs with the fastacmd command from the blastable database:

fastacmd -d AE004437.faa -i my_IDs > myseq.fastaless myseq.fasta # press q to quit

5. Looking at several different patterns:

a) Generate several lists of sequence IDs from various pattern match results (i.e. retriev

e a.my_ids, b.my_ids, and c.my_ids from step 2c).

b) Retrieve the sequences in one step using the fastacmd in a for-loop:

for i in *.my_ids; do fastacmd -d AE004437.faa -i $i > $i.fasta; done

6. Run blastall with a few proteins in myseq.fasta against your newly created Halobacterium proteome database.

Create first a complete blast output file including alignments. In a second step use the ‘m -8’ option to obtain a tabular output (i.e. tab separated values).

blastall -p blastp -i myseq.fasta -d AE004437.faa -o blastp.out -e 1e-6 -v 10 -b 10
blastall -p blastp -i myseq.fasta -d AE004437.faa -m 8 -e 1e-6 > blastp.tab

less blastp.out # press q to quit

less -S blastp.tab # -S disables line wrapping, press q to quit

The filed descriptions of the Blast tabular output (from the “-m 8” option) are available here.

1  Query (The query sequence id)
2  Subject (The matching subject sequence id)
3  % id
4  alignment length
5  mismatches
6  gap openings
7  q.start
8  q.end
9  s.start
10 s.end
11 e-value
12 bit score

Exercise 2
Split sample fasta batch file with csplit (use sequence file myseq.fasta from Exercise 1).
csplit -z myseq.fasta '/>/' '{*}'

Delete some of the files generated by csplit

Concatenate single fasta files from (step 1) into to one file with cat (e.g. “cat file1 file2 file3 > bigfile”) .

BLAST two related sequences, retrieve the result in tabular format and use “comm” to identify common hit IDs in the two tables.

Exercise 3

Create multiple alignment with ClustalW (e.g. use sequences with ‘W.H..HH’ pattern)

clustalw myseq.fasta

mv myseq.aln myalign.aln

Exercise 4

Reformat alignment into PHYILIP format using ‘seqret‘ from EMBOSS

seqret clustal::myalign.aln phylip::myalign.phylip

Exercise 5

Create neighbor-joining tree with PHYLIP

cp myalign.phylip infile

protdist # creates distance matrix (you may need to press 'R' and then 'Y')

cp outfile infileneighbor # use default settings (press 'Y')

cp outtree intree

retree # displays tree and can use midpoint method for defining root of tree, my typical command sequence is: 'N' (until you see PHYLIP) 'Y' 'M' 'W' 'R' 'R' 'X'

cp outtree tree.dnd

View your tree in TreeView