Lesson 8: Construction and interpretation of multiple sequence alignments
Automated vs. manual methods - CLUSTAL, T-COFFEE, MUSCLE, MACAW. Alignment formats. Practical construction of a protein sequence alignment.
How to produce a phylogenetic tree: start with a good alignment
Tools for alignment construction
- get yourself a local program - this takes a lot of time and computing power
- however, it is worth looking at Expasy - T-Coffee or (new and better) KALIGN or many more... and at EBI (MAFFT, MUSCLE ) or NCBI (COBALT)
- Generic COBALT output names can be filtered out here
- MUSCLE mirror
- local versions of many programs are available ... e.g. MUSCLE here
Alignment formats
- lots, and not always compatible with tree-drawing programs
- examples: ALN, MSF (PileUp/GCG), Phylip
Format conversion: several common formats via BioEdit.
Manual or automatic, that is the question.
There is already a science of evaluating alignments ... and a database of reference ones (BAliBASE).
Semi-manual - MACAW:
- locally installed, free, for Mac and PC
- interactive domain definition
- statistical data provided
- may produce false-positive blocks (read the on-line manual!)
- old, not very stable, behavior under newer OS problematic
- proprietary file format, results need editing for use in other programs
Automatic - e.g. CLUSTAL:
- "objective" results
- a number of servers available (although it is better to have your own)
- recommended for well-conserved proteins
- empiric parameters (e.g. gap penalties)
- may not work for divergent sequences (Clustal is notoriously bad at this, newer programs better)
Best of both - use BioEdit to "tune" an alignment produced by automatic tools:
- allows manual editing of a program output or de novo manual alignment
- alignments can be merged
- locally installed, free
- PC only
- somewhat contraintuitive user interface - at least for a spoiled Windows user
Tasks
This time, you can choose from the following data input options:Take at least 10 sequences of your choice from the combination of
in a manner that would address the following questions:
A. Did the diversification between GDIs and REPs take place "once for ever", or repeatably in different lineages? (Hint: choose both REPs and GDIs representing a complete set for at least one plant and at least one opisthokont.)
B. How old is the diversity of GDIs? (Hint: include at least one REP for outgroup, two or more will make the alignment easier. Choose complete sets of GDIs for a selection of organisms and make sure you know what your question is).
The tasks may take more time than usual - let me know if you need more time!.
8.1
Use T-Cofee to align your sequences.DO NOT START TOO MANY JOBS AT THE SAME TIME. It may be slow. Then do the same using 1) KALIGN and 2) MAFFT or MUSCLE and 3) COBALT.- Compare the results and performance (computation time) of these programs.
- Examine the output file formats.
8.2
Align the sequences of your choice using either Clustal X (local) or Clustal
Omega (at the EBI multiple
alignment site)..
- Try to start with the default parameters and then try at least one variation of the Gap opening and Gap extension cost.
- Look at the Quality menu: Show low scoring segments.
8.3
Align the same sequences using either MACAW or BioEdit. (If you have chosen the option A, you may use your old alignment from Lesson 6, throw out surplus sequences and merge the additional sequences into the alignment).HINT: try the BLOSUM62 matrix in MACAW.
Keep the resulting file for future use.
Compare the results of all three approaches in Bioedit and choose the best alignment for manual fine-tuning.
Finalizing the alignments
You should end up having at least two independently constructed alignments of the same set of sequences (one from an automated method, the other manually "tuned" and kept in the Fasta, native BioEdit (*.bio) or Macaw (*.mcw) format. Produce a copy of each of the files for future use in three formats:
- in either Fasta or Clustal (*.aln) format.
- in the PHYLIP (interleaved) format (*.phy).
- in the PHYLIP (interleaved) format (*.phy), gaps removed (see below).
Format conversion hints:
In ClustalX or BioEdit, you can choose the option "save as Phylip 4". In Macaw you have to do it manually by one of two methods:
- Export the alignment as text first, then use Word to remove/add extraneous spaces, numbers and names and to add the sequence and character count. (Keep the total of 10+1 positions for sequence name and spaces, though this requirement may not be strict).
- Rename a copy of the *.mcw file to *.txt and edit it manually to Fasta (sequence names from the first part of the file have to be matched to sequences in the second part).
n the next step, go over your PHYLIP format alignments and remove all positions with gaps (Word is OK for that; of course, you have to recalculate the character count). Alternatively, you can use the "strip columns containing gaps" command in BioEdit.
Keep the gapfree versions as separate files.
Example of the Phylip format:
11 761 SpGDI1 ---------- ---------- ---------- ---------- --MDEEYDVI ScGDI1 ---------- ---------- ---------- -------MDQ ETIDTDYDVI DmGDI ---------- ---------- ---------- ---------- --MDEEYDVD CeGDI1 ---------- ---------- ---------- ---------- --MDEEYDAI HsGDI2 ---------- ---------- ---------- ---------- --MNEEYDVI GgGDI ---------- ---------- ---------- ---------- --MNEEYDVI HsGDI1 ---------- ---------- ---------- ---------- --MDEEYDVI DmRepP1 ---------- ---------- ---------- --------ML DDLPEQFDLV HsREP2 ---------- ---------- ---------- --------MA DNLPTEFDVV CeY67D2 ---------- ---------- ---------- --------MD EKLPESVDVV ScMRS6 MLSPERRPSM AERRPSFFSF TQNPSPLVVP HLAGIEDPLP ATTPDKVDVL VLGTGLTECV LSG-LLSVDG KKVLHIDRND YYGADSASLN -LTQLYALFR VLGTGITECI LSG-LLSVDG KKVLHIDKQD HYGGEAASVT -LSQLYEKFK VLGTGLKECI LSGIMLSVSG KKVLHIDRNK YYGGESASIT PLEELFQRYR VLGTGLKECI ISG-MLSVSG KKVLHIDRNN YYGGESASLT PLEQLYEKFH VLGTGLTECI LSG-IMSVNG KKVLHMDRNP YYGGESASIT PLEDLYKRFK VLGTGLTECI LSG-IMSVNG KKVLHMDRNS YYGGESASIT PLEDLYKRFN VLGTGLTECI LSG-IMSVNG KKVLHMDRNP YYGGESSSIT PLEELYKRFQ VIGTGFTESC IAA-AGSRIG KSVLHLDSNE YYGDVWSSFS -MDALCARLD IIGTGLPESI LAA-ACSRSG QRVLHIDSRS YYGGNWASFS -FSGLLSWLK VLGTGLPEAI LAS-ACARAG LSVLHLDRNE YYGGDWSSFT -MSMVHEVTE IAGTGMVESV LAA-ALAWQG SNVLHIDKND YYGDTSATLT -VDQIKRWVN(alignment truncated)