Lesson 8: Construction and interpretation of multiple sequence alignments

Automated vs. manual methods - CLUSTAL, T-COFFEE, MUSCLE, MACAW. Alignment formats. Practical construction of a protein sequence alignment.

How to produce a phylogenetic tree: start with a good alignment

Tools for alignment construction

get yourself a local program - this takes a lot of time and computing power
however, it is worth looking at Expasy - T-Coffee or (new and better) KALIGN or many more... and at EBI (MAFFT, MUSCLE ) or NCBI (COBALT)
Generic COBALT output names can be filtered out here
MUSCLE mirror
local versions of many programs are available ... e.g. MUSCLE here

Alignment formats

- lots, and not always compatible with tree-drawing programs
- examples: ALN, MSF (PileUp/GCG), Phylip

Format conversion: several common formats via BioEdit.

Manual or automatic, that is the question.

There is already a science of evaluating alignments ... and a database of reference ones (BAliBASE).

Semi-manual - MACAW:

locally installed, free, for Mac and PC
interactive domain definition
statistical data provided
may produce false-positive blocks (read the on-line manual!)
old, not very stable, behavior under newer OS problematic
proprietary file format, results need editing for use in other programs

Automatic - e.g. CLUSTAL:

"objective" results
a number of servers available (although it is better to have your own)
recommended for well-conserved proteins
empiric parameters (e.g. gap penalties)
may not work for divergent sequences (Clustal is notoriously bad at this, newer programs better)

Best of both - use BioEdit to "tune" an alignment produced by automatic tools:

allows manual editing of a program output or de novo manual alignment
alignments can be merged
locally installed, free
PC only
somewhat contraintuitive user interface - at least for a spoiled Windows user

Tasks

This time, you can choose from the following data input options:

Take at least 10 sequences of your choice from the combination of

input of Task 6.2
sequences obtained in Tasks 2.3 and 4.2.

in a manner that would address the following questions:

A. Did the diversification between GDIs and REPs take place "once for ever", or repeatably in different lineages? (Hint: choose both REPs and GDIs representing a complete set for at least one plant and at least one opisthokont.)

B. How old is the diversity of GDIs? (Hint: include at least one REP for outgroup, two or more will make the alignment easier. Choose complete sets of GDIs for a selection of organisms and make sure you know what your question is).

The tasks may take more time than usual - let me know if you need more time!.

8.1

Use T-Cofee to align your sequences.DO NOT START TOO MANY JOBS AT THE SAME TIME. It may be slow. Then do the same using 1) KALIGN and 2) MAFFT or MUSCLE and 3) COBALT.

Compare the results and performance (computation time) of these programs.
Examine the output file formats.

8.2

Align the sequences of your choice using either Clustal X (local) or Clustal Omega (at the EBI multiple alignment site)..

Try to start with the default parameters and then try at least one variation of the Gap opening and Gap extension cost.
Look at the Quality menu: Show low scoring segments.

Keep the alignment (*.aln) file.

8.3

Align the same sequences using either MACAW or BioEdit. (If you have chosen the option A, you may use your old alignment from Lesson 6, throw out surplus sequences and merge the additional sequences into the alignment).
HINT: try the BLOSUM62 matrix in MACAW.
Keep the resulting file for future use.

Compare the results of all three approaches in Bioedit and choose the best alignment for manual fine-tuning.

Finalizing the alignments

You should end up having at least two independently constructed alignments of the same set of sequences (one from an automated method, the other manually "tuned" and kept in the Fasta, native BioEdit (*.bio) or Macaw (*.mcw) format. Produce a copy of each of the files for future use in three formats:

in either Fasta or Clustal (*.aln) format.
in the PHYLIP (interleaved) format (*.phy).
in the PHYLIP (interleaved) format (*.phy), gaps removed (see below).

Format conversion hints:

In ClustalX or BioEdit, you can choose the option "save as Phylip 4". In Macaw you have to do it manually by one of two methods:

Export the alignment as text first, then use Word to remove/add extraneous spaces, numbers and names and to add the sequence and character count. (Keep the total of 10+1 positions for sequence name and spaces, though this requirement may not be strict).
Rename a copy of the *.mcw file to *.txt and edit it manually to Fasta (sequence names from the first part of the file have to be matched to sequences in the second part).

n the next step, go over your PHYLIP format alignments and remove all positions with gaps (Word is OK for that; of course, you have to recalculate the character count). Alternatively, you can use the "strip columns containing gaps" command in BioEdit.

Keep the gapfree versions as separate files.

Example of the Phylip format:

11   761
SpGDI1     ---------- ---------- ---------- ---------- --MDEEYDVI
ScGDI1     ---------- ---------- ---------- -------MDQ ETIDTDYDVI
DmGDI      ---------- ---------- ---------- ---------- --MDEEYDVD
CeGDI1     ---------- ---------- ---------- ---------- --MDEEYDAI
HsGDI2     ---------- ---------- ---------- ---------- --MNEEYDVI
GgGDI      ---------- ---------- ---------- ---------- --MNEEYDVI
HsGDI1     ---------- ---------- ---------- ---------- --MDEEYDVI
DmRepP1    ---------- ---------- ---------- --------ML DDLPEQFDLV
HsREP2     ---------- ---------- ---------- --------MA DNLPTEFDVV
CeY67D2    ---------- ---------- ---------- --------MD EKLPESVDVV
ScMRS6     MLSPERRPSM AERRPSFFSF TQNPSPLVVP HLAGIEDPLP ATTPDKVDVL

           VLGTGLTECV LSG-LLSVDG KKVLHIDRND YYGADSASLN -LTQLYALFR
           VLGTGITECI LSG-LLSVDG KKVLHIDKQD HYGGEAASVT -LSQLYEKFK
           VLGTGLKECI LSGIMLSVSG KKVLHIDRNK YYGGESASIT PLEELFQRYR
           VLGTGLKECI ISG-MLSVSG KKVLHIDRNN YYGGESASLT PLEQLYEKFH
           VLGTGLTECI LSG-IMSVNG KKVLHMDRNP YYGGESASIT PLEDLYKRFK
           VLGTGLTECI LSG-IMSVNG KKVLHMDRNS YYGGESASIT PLEDLYKRFN
           VLGTGLTECI LSG-IMSVNG KKVLHMDRNP YYGGESSSIT PLEELYKRFQ
           VIGTGFTESC IAA-AGSRIG KSVLHLDSNE YYGDVWSSFS -MDALCARLD
           IIGTGLPESI LAA-ACSRSG QRVLHIDSRS YYGGNWASFS -FSGLLSWLK
           VLGTGLPEAI LAS-ACARAG LSVLHLDRNE YYGGDWSSFT -MSMVHEVTE
           IAGTGMVESV LAA-ALAWQG SNVLHIDKND YYGDTSATLT -VDQIKRWVN

(alignment truncated)

Online tools