Topic 7: Playing with DNA: gene identification, gene building and PCR primer design
Algorithmic searching for coding sequences and intron/exon structures in chromosomal DNA. Use of EST alignments for confirmation of the splicing pattern. PCR primer design.
How do I find the coding sequences?
- BLAST is a good tool to identify the chromosome region of interest.
- Splice site prediction NOT universally solved.
- A number of algorithms exist, but there is nothing like a REAL cDNA.
- First and last exons are a problem.
- EST alignment may help.
- A good paper on this topic can be found here.
Splicing rules
- Arabidopsis: most introns have GT-AG boundaries, some are AT-AC (this you will need for the excercises).
- Rules are organism-specific.
- Programs are usually only trained for the majority intron type.
How to do it
- GENSCAN: has vertebrate, arabidopsis and maize options, very nice output incl. CDS, sometimes slow, limited reliability esp. of first and last exons.
- NetGene2: arabidopsis, caenorhabditis and vertebrate statistics, a bit difficult output format, predicts only introns. This is the only method that gives alternative splicing.
- FGENESH
- GeneMark set of tools, usually GeneMark.hmm
- Last but not least: do EST alignment whenever possible. Combine BLAST and the good old MACAW (take care to reverse-complement the sequences where necessary - using SMS for manipulation and testing of the ORFs is a good idea)
- All in one: Augustus
PCR primer design
Recommended sites:
- NCBI Primer- BLAST (includes specificity check)
- Primer 3
- Netprimer (registration required, messed-up output on some systems)
Tasks
7.1.
Below you find a fragment of A. thaliana chromosome sequence.
>7_1_genomic ATTACCATAATTTAATTTGAACTTAATTTTCTCTAGGAATGGTGATGATCCACTACCACTATCATTGATT TCATTCCATATTCCTTTGACCGACTGAAATTACGTTGGAAATAGTATATTTTGATGAATAATTTATTTAC TCGGAAAAAAGAGGTCAAGTTATTAATAGTAAGTACATATACATTATCAATTAAGAATTCAATTGAGTTT TAAGGAAAATCCTATTAATTTGTTTGGTATTCGGTATTTGTTAGTTCTAAGGAATTGAATTTCCCGATTA TACATCATTATAACGTTCTCAAGTTCCAAACTTGCAACCCACATTTTGTCGATATTCTCAAATGTGAATT CATTCAATTTCCCATAGAAAACATAAATTTGCACTTAAAGTTAACAATTGAAATCGTATCTAAATGGGAA TGTTTTTGGCTTTTAGTGTTAGACTTCCAAAGCGTCAAAAATATTTCTAGAAAGAGCACAAAAAATAAGC AACGCCACTACTTTTGGACAAAGTCAACGATAACACACATCAACCGCACCAGCTCCATAAAAGTCCATCT CACGAAAACGATTCTAGTCAAACTACCTAAAACACCCTTATATTTACATACAACCCAATCCCACTAACAA GGGTATTTTCGTCAATCACAAAATTTATCACCGACCCGGGAAGAAGAAGAAGAACAGATCAACTAATTTC TGCTTTCAACTCCACATTAAACCAAAACCTCCAAAAAGAATCATTTATTTAAATTATCTTCCCGTTTTAA GTTCCTGAGATTTTTGGGAATTGTAAATTTGAAGAAAATTAAACAAAGACGTGTTTTCATTTTTTTTTTT GTTTCCTTTATTGATCTCTCTCTATCTCTCTAAATGAGCTAAATCGTTAATGGCTGCCATGTTTAATCAT CCATGGCCTAATTTAACCCTAATTTACTTCTTCTTCATCGTCGTTTTACCATTCCAATCACTTTCTCAAT TTGATTCTCCTCAAAATATCGAAACTTTCTTCCCCATCTCTTCACTCTCCCCTGTTCCACCACCGCTTCT TCCACCTTCGTCAAACCCATCTCCGCCGTCGAATAATTCATCATCTTCGGATAAAAAAACAATCACCAAA GCTGTCCTTATAACAGCAGCAAGTACTTTACTTGTAGCTGGAGTTTTCTTCTTCTGCCTCCAAAGATGTA TCATCGCACGGAGACGGAGAGACAGAGTTGGACCAGTCAGAGTCGAAAACACTTTACCTCCGTATCCTCC TCCTCCGATGACGTCGGCGGCGGTGACTACGACTACTTTGGCTAGAGAAGGATTCACGAGGTTTGGTGGT GTGAAAGGTTTGATTCTTGATGAGAATGGTCTTGATGTGTTGTATTGGAGAAAGCTACAGAGTCAGAGAG AAAGAAGTGGGAGTTTCAGGAAACAGATCGTCACCGGAGAAGAAGAAGACGAGAAAGAAGTTATTTATTA CAAGAACAAGAAGAAAACAGAGCCCGTTACAGAGATTCCTCTTCTTAGAGGAAGATCATCTACTTCTCAC AGTGTTATCCATAACGAAGATCATCAGCCGCCACCGCAGGTGAAACAGAGTGAACCAACACCACCACCGC CACCACCGTCAATTGCGGTGAAACAGAGTGCACCAACGCCATCGCCACCTCCTCCGATTAAGAAGGGTTC TTCACCATCGCCACCGCCACCTCCACCGGTGAAAAAGGTTGGAGCTTTATCATCATCAGCTTCGAAACCA CCACCTGCGCCGGTTAGAGGAGCAAGTGGAGGAGAGACTTCGAAACAAGTAAAGTTGAAGCCTTTACATT GGGATAAAGTAAACCCTGATTCCGATCATTCAATGGTTTGGGACAAAATCGATCGTGGATCATTCAGGTA TATATTTATTTCGAAAGTTAGGGCTTTTGCTTCAATCAATTGAAAAAACCCTAATTTGTTTTTGTTTCTT CTCAGTTTCGATGGCGATTTAATGGAAGCTCTGTTTGGATACGTTGCCGTGGGGAAGAAATCACCAGAAC AAGGCGATGAGAAAAACCCTAAATCAACGCAAATATTCATACTTGATCCGAGAAAGTCTCAAAACACAGC GATTGTGCTCAAATCATTAGGTATGACACGTGAAGAGCTTGTTGAATCACTCATAGAAGGAAACGATTTC GTGCCAGACACTCTTGAGAGGTTAGCTAGAATAGCTCCAACGAAAGAAGAACAATCAGCCATTCTTGAAT TCGACGGTGACACGGCAAAGCTTGCTGATGCGGAGACGTTTCTGTTTCATCTTCTTAAATCCGTGCCAAC CGCGTTTACGAGACTAAACGCGTTTCTCTTTAGGGCTAATTATTATCCAGAGATGGCTCATCATAGCAAA TGTTTACAAACGTTGGATTTAGCTTGTAAAGAGCTGAGATCTCGTGGCTTGTTTGTGAAGCTTTTGGAGG CAATACTTAAAGCTGGAAACAGAATGAACGCGGGTACCGCGAGAGGAAACGCTCAAGCGTTTAATCTAAC CGCGCTTTTGAAGCTTTCGGATGTTAAAAGCGTTGATGGGAAGACTTCTTTGCTTAACTTTGTAGTGGAG GAAGTTGTTAGATCGGAAGGAAAACGTTGTGTTATGAATAGAAGAAGCCATAGCTTAACACGAAGCGGTA GTAGTAACTACAATGGTGGTAATAGTAGTCTTCAGGTTATGTCGAAAGAAGAGCAAGAGAAAGAGTACTT GAAGCTTGGTTTACCAGTTGTTGGTGGATTGAGCTCTGAGTTTTCAAACGTGAAGAAAGCTGCTTGTGTG GACTATGAAACGGTTGTTGCAACTTGTTCTGCTCTTGCGGTTAGAGCGAAAGATGCGAAAACGGTGATTG GAGAATGTGAAGATGGAGAAGGAGGGAGGTTTGTGAAAACGATGATGACGTTTCTTGATTCGGTAGAGGA AGAGGTGAAAATAGCGAAAGGTGAAGAGAGGAAAGTGATGGAGCTTGTGAAACGTACAACGGATTATTAT CAAGCAGGAGCTGTTACAAAGGGGAAGAATCCACTTCATTTGTTTGTTATCGTTAGAGATTTTCTTGCCA TGGTTGATAAAGTTTGCTTAGATATTATGAGAAATATGCAGAGGAGGAAGGTTGGTAGTCCGATATCGCC TTCTTCGCAGCGGAATGCGGTGAAATTCCCGGTTTTGCCTCCGAATTTCATGTCGGACAGAGCTTGGAGT GATTCTGGTGGGTCGGATTCTGATATGTGAGAGTCAAGATTTGTTATATGTAAATACTAAATAGTAGAAG CATTTTGGGTATTGATTAGCATTGAAAGATGTTGAATTGTTTATAGATTTATCAGTCCAAAGCATTGGAC TTGAGTATAATTTGTTCCTTGTATAAATAAACAATTTTGCTTTAAGACCTTTCCATGTTTATGAACATGT CTTCTTTAACTTCACATAGACCTTTTGTTTACGTAAGAACTAATAATACTAAATTGTTTGATAATTCTAA ATGTGAAAGTGAACCACTATATAGTGTGAACTTGGCTTTATTGAATTCTTTTTAAAAAAATTTCTCCAGA GCTTTAGATGTAGGAGTTAATATTTTCACCTAACATAGCCTCTTTTTTATGTTTCTCTATCAACTAACAC TAAATTTGTGGATGAAGACTAAATTAACATAAGTTTATCTATTAACTAACAACCTACCAGTTTGATGCTT GTAAATATGAAACTTCAACGTTATAAAGACTATATGGTGTGAACTTTTTATCCATCTTTATTGACTTTTA AAATTTTCTTAATTTGAGTAAACAAAAGCAGAAGCTTTTTAAAGGATGCAGGAGTTGATTTTTGTATATG AACAAAACATATACTTCTCCCTTAGACGAATTTGGAGCTATCATTCTTGGTTTCAAACTTTTTAATAATT TGAGCTTTAAAGCAAAATGGCAACTTTATATTGATCACTAGTCCACAACACTTTCTCTGCCTTTTCCTCA ATAGCAACGCGTAGTCAAGAAGAAGAACGTGTTTAACATGGACCAATCTTGATTAAGATAATAGTATGAT CAAATGCTTATATAAACACACTAAAAAGGAATCAAATTTAA
Use GenScan to identify open reading frames within this sequence and keep the predicted CDS (DNA!) as a FASTA file.
7.2.
Use at least one (but preferentially more) of the other prediction servers demonstrated to obtain an independent splicing prediction for the same sequence. Keep the prediction again as a (DNA) FASTA file.HINT: to locate splice sites defined by co-ordinates (such as produced
by NetGene2 or GeneBuilder), use SMS GroupDNA to number your sequence.
7.3.
Align the two predictions to the genomic sequence using Macaw to compare the results. Then use the experimentally established mRNA sequence below to validate your prediction. Include the best matching Arabidopsis ESTs from task 3.3., which should come from the same gene, into your alignment (you may have to reverse-complement some of these sequences first). Produce an experimentally supported ORF prediction and predicted protein sequence.
>mRNA gi_23270371 Arabidopsis thaliana At1g70140 mRNA sequence AACTCCACATTAAACCAAAACCTCCAAAAAGAATCATTTATTTAAATTATCTTCCCGTTTTAAGTTCCTG AGATTTTTGGGAATTGTAAATTTGAAGAAAATTAAACAAAGACGTGTTTTCATTTTTTTTTTTGTTTCCT TTATTGATCTCTCTCTATCTCTCTAAATGAGCTAAATCGTTAATGGCTGCCATGTTTAATCATCCATGGC CTAATTTAACCCTAATTTACTTCTTCTTCATCGTCGTTTTACCATTCCAATCACTTTCTCAATTTGATTC TCCTCAAAATATCGAAACTTTCTTCCCCATCTCTTCACTCTCCCCTGTTCCACCACCGCTTCTTCCACCT TCGTCAAACCCATCTCCGCCGTCGAATAATTCATCATCTTCGGATAAAAAAACAATCACCAAAGCTGTCC TTATAACAGCAGCAAGTACTTTACTTGTAGCTGGAGTTTTCTTCTTCTGCCTCCAAAGATGTATCATCGC ACGGAGACGGAGAGACAGAGTTGGACCAGTCAGAGTCGAAAACACTTTACCTCCGTATCCTCCTCCTCCG ATGACGTCGGCGGCGGTGACTACGACTACTTTGGCTAGAGAAGGATTCACGAGGTTTGGTGGTGTGAAAG GTTTGATTCTTGATGAGAATGGTCTTGATGTGTTGTATTGGAGAAAGCTACAGAGTCAGAGAGAAAGAAG TGGGAGTTTCAGGAAACAGATCGTCACCGGAGAAGAAGAAGACGAGAAAGAAGTTATTTATTACAAGAAC AAGAAGAAAACAGAGCCCGTTACAGAGATTCCTCTTCTTAGAGGAAGATCATCTACTTCTCACAGTGTTA TCCATAACGAAGATCATCAGCCGCCACCGCAGGTGAAACAGAGTGAACCAACACCACCACCGCCACCACC GTCAATTGCGGTGAAACAGAGTGCACCAACGCCATCGCCACCTCCTCCGATTAAGAAGGGTTCTTCACCA TCGCCACCGCCACCTCCACCGGTGAAAAAGGTTGGAGCTTTATCATCATCAGCTTCGAAACCACCACCTG CGCCGGTTAGAGGAGCAAGTGGAGGAGAGACTTCGAAACAAGTAAAGTTGAAGCCTTTACATTGGGATAA AGTAAACCCTGATTCCGATCATTCAATGGTTTGGGACAAAATCGATCGTGGATCATTCAGTTTCGATGGC GATTTAATGGAAGCTCTGTTTGGATACGTTGCCGTGGGGAAGAAATCACCAGAACAAGGCGATGAGAAAA ACCCTAAATCAACGCAAATATTCATACTTGATCCGAGAAAGTCTCAAAACACAGCGATTGTGCTCAAATC ATTAGGTATGACACGTGAAGAGCTTGTTGAATCACTCATAGAAGGAAACGATTTCGTGCCAGACACTCTT GAGAGGTTAGCTAGAATAGCTCCAACGAAAGAAGAACAATCAGCCATTCTTGAATTCGACGGTGACACGG CAAAGCTTGCTGATGCGGAGACGTTTCTGTTTCATCTTCTTAAATCCGTGCCAACCGCGTTTACGAGACT AAACGCGTTTCTCTTTAGGGCTAATTATTATCCAGAGATGGCTCATCATAGCAAATGTTTACAAACGTTG GATTTAGCTTGTAAAGAGCTGAGATCTCGTGGCTTGTTTGTGAAGCTTTTGGAGGCAATACTTAAAGCTG GAAACAGAATGAACGCGGGTACCGCGAGAGGAAACGCTCAAGCGTTTAATCTAACCGCGCTTTTGAAGCT TTCGGATGTTAAAAGCGTTGATGGGAAGACTTCTTTGCTTAACTTTGTAGTGGAGGAAGTTGTTAGATCG GAAGGAAAACGTTGTGTTATGAATAGAAGAAGCCATAGCTTAACACGAAGCGGTAGTAGTAACTACAATG GTGGTAATAGTAGTCTTCAGGTTATGTCGAAAGAAGAGCAAGAGAAAGAGTACTTGAAGCTTGGTTTACC AGTTGTTGGTGGATTGAGCTCTGAGTTTTCAAACGTGAAGAAAGCTGCTTGTGTGGACTATGAAACGGTT GTTGCAACTTGTTCTGCTCTTGCGGTTAGAGCGAAAGATGCGAAAACGGTGATTGGAGAATGTGAAGATG GAGAAGGAGGGAGGTTTGTGAAAACGATGATGACGTTTCTTGATTCGGTAGAGGAAGAGGTGAAAATAGC GAAAAAAAAAAAAAAAA
7.4
Design PCR primers for detection of a 400 to 700 bp long diagnostic fragment of the cDNA assembled in Task 7.3 using NCBI Primer-Blast or Primer3. Try to select primers that would distinguish between a product amplified from the cDNA and contaminating genomic DNA in a RT-PCR experiment (hint: the Macaw alignment comes handy for this). Present a graphical map of the locus, showing positions of the primers and the cDNA exons mapped to the genomic locus.