Exercise 1
- Download proteome of Halobacterium spec. with wget and look at it:
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.faa
less AE004437.faa # press q to quit
2. Simple Analysis:
a) How many predicted proteins are there?
grep '^>' AE004437.faa
--count
b) How many proteins contain the pattern “WxHxxH” or “WxHxxHH”?
egrep 'W.H..H{1,2}' AE004437.faa
--count
c) Use the find function (/) in ‘less’ to fish out the protein IDs containing the pattern or more elegantly do it with awk:
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | less
# press q to quit
- 3. Create a BLASTable database with formatdb:
ls
# before
formatdb -i AE004437.faa -p T -o T
ls # after
'-p F' for nucleotide and '-p T' for protein database; '-o T' parse SeqId and create indexes
4. Generate myseq.fasta
- a) Generate list of sequence IDs for the above pattern match result
(i.e. retrieve my_IDs
from step 2c). Alternatively, download the pre-generated file with wget.
- b) Retrieve the corresponding sequences for these IDs with the fastacmd command from the blastable database:
fastacmd -d AE004437.faa -i my_IDs > myseq.fastalessmyseq.fasta
# press q to quit
5. Looking at several different patterns:
- a) Generate several lists of sequence IDs from various pattern match results (i.e. retriev
e a.my_ids
, b.my_ids
, and c.my_ids
from step 2c).
- b) Retrieve the sequences in one step using the fastacmd in a for-loop:
for i in *.my_ids; do fastacmd -d AE004437.faa -i $i > $i.fasta; done
- 6. Run blastall with a few proteins in myseq.fasta against your newly created Halobacterium proteome database.
- Create first a complete blast output file including alignments. In a second step use the ‘m -8’ option to obtain a tabular output (i.e. tab separated values).
blastall -p blastp -i myseq.fasta -d AE004437.faa -o blastp.out -e 1e-6 -v 10 -b 10
blastall -p blastp -i myseq.fasta -d
AE004437.faa -m 8
-e 1e-6
> blastp.tab
lessblastp.out
# press q to quit
less -S
blastp.tab
# -S disables line wrapping, press q to quit
The filed descriptions of the Blast tabular output (from the “-m 8” option) are available here.
1 Query (The query sequence id)
2 Subject (The matching subject sequence id)
3 % id
4 alignment length
5 mismatches
6 gap openings
7 q.start
8 q.end
9 s.start
10 s.end
11 e-value
12 bit score
- Exercise 2
-
- Split sample fasta batch file with csplit (use sequence file myseq.fasta from Exercise 1).
csplit -z myseq.fasta '/>/' '{*}'
- Delete some of the files generated by csplit
- Concatenate single fasta files from (step 1) into to one file with cat (e.g. “cat file1 file2 file3 > bigfile”) .
- BLAST two related sequences, retrieve the result in tabular format and use “comm” to identify common hit IDs in the two tables.
- Split sample fasta batch file with csplit (use sequence file myseq.fasta from Exercise 1).
- Exercise 3
- Create multiple alignment with ClustalW (e.g. use sequences with ‘W.H..HH’ pattern)
clustalw myseq.fasta
mv myseq.aln myalign.aln
- Exercise 4
- Reformat alignment into PHYILIP format using ‘seqret‘ from EMBOSS
seqret clustal::myalign.aln phylip::myalign.phylip
- Exercise 5
- Create neighbor-joining tree with PHYLIP
cp myalign.phylip infile
protdist
# creates distance matrix (you may need to press 'R' and then 'Y')
cp outfile infile
neighbor
# use default settings (press 'Y')
cp outtree intree
retree
# displays tree and can use midpoint method for defining root of tree, my typical command sequence is: 'N' (until you see PHYLIP) 'Y' 'M' 'W' 'R' 'R' 'X'
cp outtree tree.dnd
View your tree in TreeView