Lesson 8: Construction and interpretation of multiple sequence alignments

Automated vs. manual methods - CLUSTAL, T-COFFEE,  MUSCLE, MACAW. Alignment formats. Practical construction of a protein sequence alignment.


How to produce a phylogenetic tree: start with a good alignment

Tools for alignment construction


Alignment formats

- lots, and not always compatible with tree-drawing programs
- examples: ALN, MSF (PileUp/GCG), Phylip

Format conversion: several common formats via BioEdit.


Manual or automatic, that is the question.

There is already a science of evaluating alignments ... and a database of reference ones (BAliBASE).

Semi-manual - MACAW:

Automatic - e.g. CLUSTAL:

Best of both - use BioEdit to "tune" an alignment produced by automatic tools:


Tasks

This time, you can choose from the following data input options:

Take at least 10 sequences of your choice from the combination of

in a manner that would address the following questions:

A. Did the diversification between GDIs and REPs take place "once for ever", or repeatably in different lineages? (Hint: choose both REPs and GDIs representing a complete set for at least one plant and at least one opisthokont.)

B. How old is the diversity of GDIs? (Hint: include at least one REP for outgroup, two or more will make the alignment easier. Choose complete sets of GDIs for a selection of organisms and make sure you know what your question is).


The tasks may take more time than usual - let me know if you need more time!.


8.1

Use T-Cofee to align your sequences.DO NOT START TOO MANY JOBS AT THE SAME TIME. It may be slow. Then do the same using 1) KALIGN and 2) MAFFT or MUSCLE and 3) COBALT.


8.2

Align the sequences of your choice using either Clustal X (local) or Clustal Omega (at the EBI multiple alignment site).. Keep the alignment (*.aln) file.

8.3

Align the same sequences using either MACAW or BioEdit. (If you have chosen the option A, you may use your old alignment from Lesson 6, throw out surplus sequences and merge the additional sequences into the alignment).
HINT: try the BLOSUM62 matrix in MACAW.
Keep the resulting file for future use.

Compare the results of all three approaches in Bioedit and choose the best alignment for manual fine-tuning.

Finalizing the alignments


You should end up having at least two independently constructed alignments of the same set of sequences (one from an automated method, the other manually "tuned" and kept in the Fasta, native BioEdit (*.bio) or Macaw (*.mcw) format. Produce a copy of each of the files for future use in three formats:
  1. in either Fasta or Clustal (*.aln) format.
  2. in the PHYLIP (interleaved) format (*.phy).
  3. in the PHYLIP (interleaved) format (*.phy), gaps removed (see below).

Format conversion hints:

In ClustalX or BioEdit, you can choose the option "save as Phylip 4". In Macaw you have to do it manually by one of two methods:

  1. Export the alignment as text first, then use Word to remove/add extraneous spaces, numbers and names and to add the sequence and character count. (Keep the total of 10+1 positions for sequence name and spaces, though this requirement may not be strict).
  2. Rename a copy of the *.mcw file to *.txt and edit it manually to Fasta (sequence names from the first part of the file have to be matched to sequences in the second part).

n the next step, go over your PHYLIP format alignments and remove all positions with gaps (Word is OK for that; of course, you have to recalculate the character count). Alternatively, you can use the "strip columns containing gaps" command in BioEdit.

Keep the gapfree versions as separate files.

Example of the Phylip format:

11   761
SpGDI1     ---------- ---------- ---------- ---------- --MDEEYDVI
ScGDI1     ---------- ---------- ---------- -------MDQ ETIDTDYDVI
DmGDI      ---------- ---------- ---------- ---------- --MDEEYDVD
CeGDI1     ---------- ---------- ---------- ---------- --MDEEYDAI
HsGDI2     ---------- ---------- ---------- ---------- --MNEEYDVI
GgGDI      ---------- ---------- ---------- ---------- --MNEEYDVI
HsGDI1     ---------- ---------- ---------- ---------- --MDEEYDVI
DmRepP1    ---------- ---------- ---------- --------ML DDLPEQFDLV
HsREP2     ---------- ---------- ---------- --------MA DNLPTEFDVV
CeY67D2    ---------- ---------- ---------- --------MD EKLPESVDVV
ScMRS6     MLSPERRPSM AERRPSFFSF TQNPSPLVVP HLAGIEDPLP ATTPDKVDVL

           VLGTGLTECV LSG-LLSVDG KKVLHIDRND YYGADSASLN -LTQLYALFR
           VLGTGITECI LSG-LLSVDG KKVLHIDKQD HYGGEAASVT -LSQLYEKFK
           VLGTGLKECI LSGIMLSVSG KKVLHIDRNK YYGGESASIT PLEELFQRYR
           VLGTGLKECI ISG-MLSVSG KKVLHIDRNN YYGGESASLT PLEQLYEKFH
           VLGTGLTECI LSG-IMSVNG KKVLHMDRNP YYGGESASIT PLEDLYKRFK
           VLGTGLTECI LSG-IMSVNG KKVLHMDRNS YYGGESASIT PLEDLYKRFN
           VLGTGLTECI LSG-IMSVNG KKVLHMDRNP YYGGESSSIT PLEELYKRFQ
           VIGTGFTESC IAA-AGSRIG KSVLHLDSNE YYGDVWSSFS -MDALCARLD
           IIGTGLPESI LAA-ACSRSG QRVLHIDSRS YYGGNWASFS -FSGLLSWLK
           VLGTGLPEAI LAS-ACARAG LSVLHLDRNE YYGGDWSSFT -MSMVHEVTE
           IAGTGMVESV LAA-ALAWQG SNVLHIDKND YYGDTSATLT -VDQIKRWVN 
    
(alignment truncated)