Topic 5: Sequence motif searches and protein domain structure analysis I
Extraction of sequence characteristics and searching for known domains - SMART, PROSITE and similar resources.
What can we tell from a sequence (almost) on a first glance?
DNA:
- restriction mapping, ORF searches (see here)
- pattern searches (see e.g. SMS)
- finding coding exons, transcription factor binding sites etc. - see e.g. here and some tools at Softberry Inc.
Protein:
- size, composition, primary structure characteristics (Expasy, SMS Protein Stats)
- pI (Expasy) .BEWARE!!!.. NOTE: it is worth having a closer look at Expasy ... we will do it in the next lesson as well.
- presence of well-known domains
- NCBI - CDD (see here)
- PROSITE (available via Expasy)
- Pfam (deprecated)
- InterProScan
- presence of localisation signals
- all of the above in a single step ... SMART!
Tasks
5.1.
Compute the molecular mass and pI of a selected extensin from Task 2.2 using SMS or the Expasy "Compute pI/mw" tool. .
5.2
Analyse the sequence below for secretory and transmembrane localisation
signals using
- SignalP 6.0 (gives a comparison of two methods with a nice graphical input)
- TMHMM
MAMRLLKTHLLFLHLYLFFSPCFAYTDMEVLLNLKSSMIGPKGHGLHDWIHSSSPDAHCSFSGVSCDDDA
RVISLNVSFTPLFGTISPEIGMLTHLVNLTLAANNFTGELPLEMKSLTSLKVLNISNNGNLTGTFPGEIL
KAMVDLEVLDTYNNNFNGKLPPEMSELKKLKYLSFGGNFFSGEIPESYGDIQSLEYLGLNGAGLSGKSPA
FLSRLKNLREMYIGYYNSYTGGVPREFGGLTKLEILDMASCTLTGEIPTSLSNLKHLHTLFLHINNLTGH
IPPELSGLVSLKSLDLSINQLTGEIPQSFINLGNITLINLFRNNLYGQIPEAIGELPKLEVFEVWENNFT
LQLPANLGRNGNLIKLDVSDNHLTGLIPKDLCRGEKLEMLILSNNFFFGPIPEELGKCKSLTKIRIVKNL
LNGTVPAGLFNLPLVTIIELTDNFFSGELPVTMSGDVLDQIYLSNNWFSGEIPPAIGNFPNLQTLFLDRN
RFRGNIPREIFELKHLSRINTSANNITGGIPDSISRCSTLISVDLSRNRINGEIPKGINNVKNLGTLNIS
GNQLTGSIPTGIGNMTSLTTLDLSFNDLSGRVPLGGQFLVFNETSFAGNTYLCLPHRVSCPTRPGQTSDH
NHTALFSPSRIVITVIAAITGLILISVAIRQMNKKKNQKSLAWKLTAFQKLDFKSEDVLECLKEENIIGK
GGAGIVYRGSMPNNVDVAIKRLVGRGTGRSDHGFTAEIQTLGRIRHRHIVRLLGYVANKDTNLLLYEYMP
NGSLGELLHGSKGGHLQWETRHRVAVEAAKGLCYLHHDCSPLILHRDVKSNNILLDSDFEAHVADFGLAK
FLVDGAASECMSSIAGSYGYIAPEYAYTLKVDEKSDVYSFGVVLLELIAGKKPVGEFGEGVDIVRWVRNT
EEEITQPSDAAIVVAIVDPRLTGYPLTSVIHVFKIAMMCVEEEAAARPTMREVVHMLTNPPKSVANLIAF
5.3.
Analyse the sequence above using the Expasy Scan PROSITE search and by SMART
and examine the results.
Then take the region of the protein sequence containing LRRs (leucine-rich repeats)
and extract sequences of at least ten randomly selected LRRs (leucine-rich repeats)
identified by SMART, Keep them in a "multi-FASTA" file.. Then perform a search
for repeats using RADAR
tool at the EBI and keep the results (HTML file) for future use..
Prediction of secondary structure elements from protein sequence. Methods, pros and cons.
Protein secondary structure prediction
Online tools:
- Network Protein Sequence Analysis @ PBIL-IBCP Lyon-Gerland (consensus prediction from different algorithms, e-mail not required)
- Jpred (consensus prediction from different algorithms, e-mail not required)
- PsiPred
- Predict Protein
(several different algorithms, registration required)
Golden rules: Avoid traditional Chou and Fasman algorithm.
Note the accuracy of the algorithms on standard benchmarks and "real life
situations".
Use methods based on multiple alignments. Check carefully the alignment -
avoid redundancies.
Use several independent methods, of similar accuracy.
In case of disagreement, trust PHD (PredictProtein), Jnet
(Jpred)
and PsiPred.
Presentation by M. Potocký (from 2004 TIPNET course, addresses may be outdated)
Tasks:
5.4
>gi|22331122|ref|NP_188302.2| phospholipase D zeta1 / PLDzeta1 (PLDP1) [Arabidopsis thaliana]
MASEQLMSPASGGGRYFQMQPEQFPSMVSSLFSFAPAPTQETNRIFEELPKAVIVSVSRPDAGDISPVLL
SYTIECQYKQFKWQLVKKASQVFYLHFALKKRAFIEEIHEKQEQVKEWLQNLGIGDHPPVVQDEDADEVP
LHQDESAKNRDVPSSAALPVIRPLGRQQSISVRGKHAMQEYLNHFLGNLDIVNSREVCRFLEVSMLSFSP
EYGPKLKEDYIMVKHLPKFSKSDDDSNRCCGCCWFCCCNDNWQKVWGVLKPGFLALLEDPFDAKLLDIIV
FDVLPVSNGNDGVDISLAVELKDHNPLRHAFKVTSGNRSIRIRAKNSAKVKDWVASINDAALRPPEGWCH
PHRFGSYAPPRGLTDDGSQAQWFVDGGAAFAAIAAAIENAKSEIFICGWWVCPELYLRRPFDPHTSSRLD
NLLENKAKQGVQIYILIYKEVALALKINSVYSKRRLLGIHENVRVLRYPDHFSSGVYLWSHHEKLVIVDN
QVCFIGGLDLCFGRYDTFEHKVGDNPSVTWPGKDYYNPRESEPNTWEDALKDELERKKHPRMPWHDVHCA
LWGPPCRDVARHFVQRWNYAKRNKAPYEDSIPLLMPQHHMVIPHYMGRQEESDIESKKEEDSIRGIRRDD
SFSSRSSLQDIPLLLPHEPVDQDGSSGGHKENGTNNRNGPFSFRKSKIEPVDGDTPMRGFVDDRNGLDLP
VAKRGSNAIDSEWWETQDHDYQVGSPDETGQVGPRTSCRCQIIRSVSQWSAGTSQVEESIHSAYRSLIDK
AEHFIYIENQFFISGLSGDDTVKNRVLEALYKRILRAHNEKKIFRVVVVIPLLPGFQGGIDDSGAASVRA
IMHWQYRTIYRGHNSILTNLYNTIGVKAHDYISFYGLRAYGKLSEDGPVATSQVYVHSKIMIVDDRAALI
GSANINDRSLLGSRDSEIGVLIEDTELVDSRMAGKPWKAGKFSSSLRLSLWSEHLGLRTGEIDQIIDPVS
DSTYKEIWMATAKTNTMIYQDVFSCVPNDLIHSRMAFRQSLSYWKEKLGHTTIDLGIAPEKLESYHNGDI
KRSDPMDRLKAIKGHLVSFPLDFMCKEDLRPVFNESEYYASPQVFH
- Predict secondary structure for one of these domains with at least two different programs and compare the results (Results of at least one program, e.g. NPSA, will be accepted for credits).
Prediction of RNA secondary structure
- RNAFold web server
- Moscow state university RNA folding server (allows incorporation of alignment data but performs poorly with single sequence)
- Classic: mfold
(use the 2.3 version that allows temperature settings)
Tasks:
5.5
Predict 2D structure of
hop viroid RNA
>gi|13872751|emb|AJ290412.1|HLA290412 Hop latent viroid sequence of 'thermomutant' T229
CTGGGGAATACACTACGTGACTTACCTGTATGATGGCAAGGGTTCGAAGAGGGATCCCCGGGGAAACCTA
CTCGAGCGAGGCGGAGATCGAGCGCCAGTTCGTGCGCGGCGACCTGAAGTTGCTTCGGCTTCTTCTTGTT
CGCGTCCTGCGTGGAACGGCTCCTTCTCCACACCAGCCGGAGTTGGAAACTACCCGGTGGATACAACTCT
TGAGCGCCGAGCTTTACCTGCAGAAGTTCACATAAAAAGTGCCCAT