Topic 4: Sequence similarity searches II
Special features and implementations of BLAST. A closer look at the pairwise comparison/pairwise alignment problem.
There are quite a few things BLAST can do besides simple searching.
And some things BLAST cannot do ...
- A proline is a proline in any context, and low complexity sequences are (usually) useless ... or why (and why not) are we using the low complexity filter.
- NCBI BLAST searches a pre-compiled domain database by default (CD - conserved domain - search).
- BLAST algorithm can be used to align 2 sequences together (AND there are other ways to have a quick look at sequence similarity ... intro to various meas of making dotplots)
- Position-specific iterated BLAST (PSI-BLAST) is a very sensitive method for detecting diverged sequence motifs (see also here).
- Preset values did not come from the Heaven.
Tasks
4.1
Pick one of the longer and visibly repetitive extensin sequences collected in Task 2.2 and use it to perform the following searches on the non-redundant (nr) protein database at the NCBI BLAST site- a standard ("old") BLASTP search with default parameters
- an accelerated BLASTP (Quick-BLASTP) search
- an accelerated BLASTP (Quick-BLASTP) search with the Low Complexity Filter on (accessible in Advanced settings).
Compare the time of computation between "normal" and "quick"
BLASTP and the results of all three methods..
4.2
Pick one of the Arabidopsis RabGDI homologues found in Task 2.3 and use it to perform a NCBI blastp search on the non-redundant (nr) database, keeping all the parameters default.- Examine the results of the CD search. (You can also perform a separate CD search from the Blast homepage; it is recommended to use the "full results" option).
- Look at the BLAST results - first as they come, then use the formatting page to select only Arabidopsis thaliana.sequences. Then perform the search again but restrict the database to Arabidopsis thaliana already at the search stage. What is the difference between these results?
- Collect all newly found significantly matching Arabidopsis thaliana reference sequences(RefSeq) into a multi-FASTA file. Add three to five non-reference sequences identified in this searchinto the file and try to assign themstandard locus names(see Task 2.3) using sequence comparison by pairwise BLAST (BLAST2); this will also allow identifying possible allelic variants or sequencing errors. Provide the list of sequences including all the reference proteins identified previously, any new reference proteins identified in this task, and the non-reference sequences assigned to reference loci. Did you find any other class of proteins than RabGDIs?
4.3
Compare the following sequences using the BLAST 2 sequences option on NCBI. Examine the effects of- changing the scoring matrix
- changing the gap penalties using a particular scoring matrix
Sequences:
>NP_564369.1 zinc finger (C2H2 type) family protein [Arabidopsis thaliana]
MGKKKKRATEKVWCYYCDREFDDEKILVQHQKAKHFKCHVCHKKLSTASGMVIHVLQVHKENVTKVPNAK
DGRDSTDIEIYGMQGIPPHVLTAHYGEEEDEPPAKVAKVEIPSAPLGGVVPRPYGMVYPPQQVPGAVPAR
PMYYPGPPMRHPAPVWQMPPPRPQQWYPQNPALSVPPAAHLGYRPQPLFPVQNMGMTPTPTSAPAIQPSP
VTGVTPPGIPTSSPAMPVPQPLFPVVNNSIPSQAPPFSAPLPVGGAQQPSHADALGSADAYPPNNSIPGG
TNAHSYASGPNTSGPSIGPPPVIANKAPSNQPNEVYLVWDDEAMSMEERRMSLPKYKVHDETSQMNSINA
AIDRRISESRLAGRMAF
>NP_001080324.1 BUB3-interacting and GLEBS motif-containing protein ZNF207 [Xenopus laevis]
MGRKKKKQLKPWCWYCNRDFDDEKILIQHQKAKHFKCHICHKKLYTGPGLAIHCMQVHKETIDAVPNAIP
GRTDIELEIYGMEGIPEKDMEERRRILEQKTQVDGQKKKTNQDDSDYDDDDDTAPSTSFQQMQTQQAFMP
TMGQPGIPGLPGAPGMPPGITSLMPAVPPLISGIPHVMAGMHPHGMMSMGGMMHPHRPGIPPMMAGLPPG
VPPPGLRPGIPPVTQAQPALSQAVVSRLPVPSTSAPALQSVPKPLFPSAGQAQAHISGPVGTDFKPLNNI
PATTAEHPKPTFPAYTQSTMSTTSTTNSTASKPSTSITSKPATLTTTSATSKLVHPDEDISLEEKRAQLP
KYQRNLPRPGQAPISNMGSTAVGPLGAMMAPRPGLPPQQHGMRHPLPPHGQYGAPLQGMAGYHPGTMPPF
GQGPPMVPPFQGGPPRPLMGIRPPVMSQGGRY
4.4
Dotplots, such as those seen in the previous lesson, are a good way to get a quick impression of sequence homology, especially in nucleotide sequences.
Here is a DNA sequence that may contain repeats. Download it to your disk and generate a reverse-complemented sequence using SMS Reverse Complement. Save the reverse-complemented sequence in a Fasta file.
Use the EMBOSS Dotmatcher online tool to produce a dotplot (both for the direct and complemetary strand). Examine the effects of altering window size and threshold values (hint: first try default vs. threshold = 50).
4.5
Use PSI_BLAST to identify eukaryotic homologs of Staphylococcus
aureus exfoliative toxin A, using
the WP_001065781.1 sequence as a query. Run at least one, better two
iterations and document the results by the output of
the last iteration done. What eukaryotic
protein family is related to the toxin?