Topic 7: Playing with DNA: gene identification, gene building and PCR primer design

Algorithmic searching for coding sequences and intron/exon structures in chromosomal DNA. Use of EST alignments for confirmation of the splicing pattern. PCR primer design.


How do I find the coding sequences?


Splicing rules


How to do it


PCR primer design

Recommended sites:


Tasks

7.1.

Below you find a fragment of A. thaliana chromosome sequence.

>7_1_genomic
ATTACCATAATTTAATTTGAACTTAATTTTCTCTAGGAATGGTGATGATCCACTACCACTATCATTGATT
TCATTCCATATTCCTTTGACCGACTGAAATTACGTTGGAAATAGTATATTTTGATGAATAATTTATTTAC
TCGGAAAAAAGAGGTCAAGTTATTAATAGTAAGTACATATACATTATCAATTAAGAATTCAATTGAGTTT
TAAGGAAAATCCTATTAATTTGTTTGGTATTCGGTATTTGTTAGTTCTAAGGAATTGAATTTCCCGATTA
TACATCATTATAACGTTCTCAAGTTCCAAACTTGCAACCCACATTTTGTCGATATTCTCAAATGTGAATT
CATTCAATTTCCCATAGAAAACATAAATTTGCACTTAAAGTTAACAATTGAAATCGTATCTAAATGGGAA
TGTTTTTGGCTTTTAGTGTTAGACTTCCAAAGCGTCAAAAATATTTCTAGAAAGAGCACAAAAAATAAGC
AACGCCACTACTTTTGGACAAAGTCAACGATAACACACATCAACCGCACCAGCTCCATAAAAGTCCATCT
CACGAAAACGATTCTAGTCAAACTACCTAAAACACCCTTATATTTACATACAACCCAATCCCACTAACAA
GGGTATTTTCGTCAATCACAAAATTTATCACCGACCCGGGAAGAAGAAGAAGAACAGATCAACTAATTTC
TGCTTTCAACTCCACATTAAACCAAAACCTCCAAAAAGAATCATTTATTTAAATTATCTTCCCGTTTTAA
GTTCCTGAGATTTTTGGGAATTGTAAATTTGAAGAAAATTAAACAAAGACGTGTTTTCATTTTTTTTTTT
GTTTCCTTTATTGATCTCTCTCTATCTCTCTAAATGAGCTAAATCGTTAATGGCTGCCATGTTTAATCAT
CCATGGCCTAATTTAACCCTAATTTACTTCTTCTTCATCGTCGTTTTACCATTCCAATCACTTTCTCAAT
TTGATTCTCCTCAAAATATCGAAACTTTCTTCCCCATCTCTTCACTCTCCCCTGTTCCACCACCGCTTCT
TCCACCTTCGTCAAACCCATCTCCGCCGTCGAATAATTCATCATCTTCGGATAAAAAAACAATCACCAAA
GCTGTCCTTATAACAGCAGCAAGTACTTTACTTGTAGCTGGAGTTTTCTTCTTCTGCCTCCAAAGATGTA
TCATCGCACGGAGACGGAGAGACAGAGTTGGACCAGTCAGAGTCGAAAACACTTTACCTCCGTATCCTCC
TCCTCCGATGACGTCGGCGGCGGTGACTACGACTACTTTGGCTAGAGAAGGATTCACGAGGTTTGGTGGT
GTGAAAGGTTTGATTCTTGATGAGAATGGTCTTGATGTGTTGTATTGGAGAAAGCTACAGAGTCAGAGAG
AAAGAAGTGGGAGTTTCAGGAAACAGATCGTCACCGGAGAAGAAGAAGACGAGAAAGAAGTTATTTATTA
CAAGAACAAGAAGAAAACAGAGCCCGTTACAGAGATTCCTCTTCTTAGAGGAAGATCATCTACTTCTCAC
AGTGTTATCCATAACGAAGATCATCAGCCGCCACCGCAGGTGAAACAGAGTGAACCAACACCACCACCGC
CACCACCGTCAATTGCGGTGAAACAGAGTGCACCAACGCCATCGCCACCTCCTCCGATTAAGAAGGGTTC
TTCACCATCGCCACCGCCACCTCCACCGGTGAAAAAGGTTGGAGCTTTATCATCATCAGCTTCGAAACCA
CCACCTGCGCCGGTTAGAGGAGCAAGTGGAGGAGAGACTTCGAAACAAGTAAAGTTGAAGCCTTTACATT
GGGATAAAGTAAACCCTGATTCCGATCATTCAATGGTTTGGGACAAAATCGATCGTGGATCATTCAGGTA
TATATTTATTTCGAAAGTTAGGGCTTTTGCTTCAATCAATTGAAAAAACCCTAATTTGTTTTTGTTTCTT
CTCAGTTTCGATGGCGATTTAATGGAAGCTCTGTTTGGATACGTTGCCGTGGGGAAGAAATCACCAGAAC
AAGGCGATGAGAAAAACCCTAAATCAACGCAAATATTCATACTTGATCCGAGAAAGTCTCAAAACACAGC
GATTGTGCTCAAATCATTAGGTATGACACGTGAAGAGCTTGTTGAATCACTCATAGAAGGAAACGATTTC
GTGCCAGACACTCTTGAGAGGTTAGCTAGAATAGCTCCAACGAAAGAAGAACAATCAGCCATTCTTGAAT
TCGACGGTGACACGGCAAAGCTTGCTGATGCGGAGACGTTTCTGTTTCATCTTCTTAAATCCGTGCCAAC
CGCGTTTACGAGACTAAACGCGTTTCTCTTTAGGGCTAATTATTATCCAGAGATGGCTCATCATAGCAAA
TGTTTACAAACGTTGGATTTAGCTTGTAAAGAGCTGAGATCTCGTGGCTTGTTTGTGAAGCTTTTGGAGG
CAATACTTAAAGCTGGAAACAGAATGAACGCGGGTACCGCGAGAGGAAACGCTCAAGCGTTTAATCTAAC
CGCGCTTTTGAAGCTTTCGGATGTTAAAAGCGTTGATGGGAAGACTTCTTTGCTTAACTTTGTAGTGGAG
GAAGTTGTTAGATCGGAAGGAAAACGTTGTGTTATGAATAGAAGAAGCCATAGCTTAACACGAAGCGGTA
GTAGTAACTACAATGGTGGTAATAGTAGTCTTCAGGTTATGTCGAAAGAAGAGCAAGAGAAAGAGTACTT
GAAGCTTGGTTTACCAGTTGTTGGTGGATTGAGCTCTGAGTTTTCAAACGTGAAGAAAGCTGCTTGTGTG
GACTATGAAACGGTTGTTGCAACTTGTTCTGCTCTTGCGGTTAGAGCGAAAGATGCGAAAACGGTGATTG
GAGAATGTGAAGATGGAGAAGGAGGGAGGTTTGTGAAAACGATGATGACGTTTCTTGATTCGGTAGAGGA
AGAGGTGAAAATAGCGAAAGGTGAAGAGAGGAAAGTGATGGAGCTTGTGAAACGTACAACGGATTATTAT
CAAGCAGGAGCTGTTACAAAGGGGAAGAATCCACTTCATTTGTTTGTTATCGTTAGAGATTTTCTTGCCA
TGGTTGATAAAGTTTGCTTAGATATTATGAGAAATATGCAGAGGAGGAAGGTTGGTAGTCCGATATCGCC
TTCTTCGCAGCGGAATGCGGTGAAATTCCCGGTTTTGCCTCCGAATTTCATGTCGGACAGAGCTTGGAGT
GATTCTGGTGGGTCGGATTCTGATATGTGAGAGTCAAGATTTGTTATATGTAAATACTAAATAGTAGAAG
CATTTTGGGTATTGATTAGCATTGAAAGATGTTGAATTGTTTATAGATTTATCAGTCCAAAGCATTGGAC
TTGAGTATAATTTGTTCCTTGTATAAATAAACAATTTTGCTTTAAGACCTTTCCATGTTTATGAACATGT
CTTCTTTAACTTCACATAGACCTTTTGTTTACGTAAGAACTAATAATACTAAATTGTTTGATAATTCTAA
ATGTGAAAGTGAACCACTATATAGTGTGAACTTGGCTTTATTGAATTCTTTTTAAAAAAATTTCTCCAGA
GCTTTAGATGTAGGAGTTAATATTTTCACCTAACATAGCCTCTTTTTTATGTTTCTCTATCAACTAACAC
TAAATTTGTGGATGAAGACTAAATTAACATAAGTTTATCTATTAACTAACAACCTACCAGTTTGATGCTT
GTAAATATGAAACTTCAACGTTATAAAGACTATATGGTGTGAACTTTTTATCCATCTTTATTGACTTTTA
AAATTTTCTTAATTTGAGTAAACAAAAGCAGAAGCTTTTTAAAGGATGCAGGAGTTGATTTTTGTATATG
AACAAAACATATACTTCTCCCTTAGACGAATTTGGAGCTATCATTCTTGGTTTCAAACTTTTTAATAATT
TGAGCTTTAAAGCAAAATGGCAACTTTATATTGATCACTAGTCCACAACACTTTCTCTGCCTTTTCCTCA
ATAGCAACGCGTAGTCAAGAAGAAGAACGTGTTTAACATGGACCAATCTTGATTAAGATAATAGTATGAT
CAAATGCTTATATAAACACACTAAAAAGGAATCAAATTTAA

Use Augustus, NetGene2 and at least one of the other listed prediction servers to identify open reading frames on the forward strand of this sequence and keep the predicted mRNA (if available) and CDS, (DNA!)  in a multiple sequence FASTA file.


7.2.

Obtain the predicted CDS (DNA) sequence for the method(s) that provide only exon coordinates (GeneMark, NetGene2) using tols from SMS (Range Extractor to remove introns, ORF finder to identify coding sequences). Add these CDS predictions to your FASTA file from 7.1.

7.3.

Align your mRNA and CDS predictions to the genomic sequence using Macaw to compare the results. Then use the experimentally established mRNA sequence below to validate your prediction. Include the best matching Arabidopsis ESTs from task 3.3., which should came from the same gene,  into your alignment (you may have to reverse-complement some of these sequences first). Produce an experimentally supported ORF prediction and predicted protein sequence.

>mRNA gi_23270371 Arabidopsis thaliana At1g70140 mRNA sequence
AACTCCACATTAAACCAAAACCTCCAAAAAGAATCATTTATTTAAATTATCTTCCCGTTTTAAGTTCCTG
AGATTTTTGGGAATTGTAAATTTGAAGAAAATTAAACAAAGACGTGTTTTCATTTTTTTTTTTGTTTCCT
TTATTGATCTCTCTCTATCTCTCTAAATGAGCTAAATCGTTAATGGCTGCCATGTTTAATCATCCATGGC
CTAATTTAACCCTAATTTACTTCTTCTTCATCGTCGTTTTACCATTCCAATCACTTTCTCAATTTGATTC
TCCTCAAAATATCGAAACTTTCTTCCCCATCTCTTCACTCTCCCCTGTTCCACCACCGCTTCTTCCACCT
TCGTCAAACCCATCTCCGCCGTCGAATAATTCATCATCTTCGGATAAAAAAACAATCACCAAAGCTGTCC
TTATAACAGCAGCAAGTACTTTACTTGTAGCTGGAGTTTTCTTCTTCTGCCTCCAAAGATGTATCATCGC
ACGGAGACGGAGAGACAGAGTTGGACCAGTCAGAGTCGAAAACACTTTACCTCCGTATCCTCCTCCTCCG
ATGACGTCGGCGGCGGTGACTACGACTACTTTGGCTAGAGAAGGATTCACGAGGTTTGGTGGTGTGAAAG
GTTTGATTCTTGATGAGAATGGTCTTGATGTGTTGTATTGGAGAAAGCTACAGAGTCAGAGAGAAAGAAG
TGGGAGTTTCAGGAAACAGATCGTCACCGGAGAAGAAGAAGACGAGAAAGAAGTTATTTATTACAAGAAC
AAGAAGAAAACAGAGCCCGTTACAGAGATTCCTCTTCTTAGAGGAAGATCATCTACTTCTCACAGTGTTA
TCCATAACGAAGATCATCAGCCGCCACCGCAGGTGAAACAGAGTGAACCAACACCACCACCGCCACCACC
GTCAATTGCGGTGAAACAGAGTGCACCAACGCCATCGCCACCTCCTCCGATTAAGAAGGGTTCTTCACCA
TCGCCACCGCCACCTCCACCGGTGAAAAAGGTTGGAGCTTTATCATCATCAGCTTCGAAACCACCACCTG
CGCCGGTTAGAGGAGCAAGTGGAGGAGAGACTTCGAAACAAGTAAAGTTGAAGCCTTTACATTGGGATAA
AGTAAACCCTGATTCCGATCATTCAATGGTTTGGGACAAAATCGATCGTGGATCATTCAGTTTCGATGGC
GATTTAATGGAAGCTCTGTTTGGATACGTTGCCGTGGGGAAGAAATCACCAGAACAAGGCGATGAGAAAA
ACCCTAAATCAACGCAAATATTCATACTTGATCCGAGAAAGTCTCAAAACACAGCGATTGTGCTCAAATC
ATTAGGTATGACACGTGAAGAGCTTGTTGAATCACTCATAGAAGGAAACGATTTCGTGCCAGACACTCTT
GAGAGGTTAGCTAGAATAGCTCCAACGAAAGAAGAACAATCAGCCATTCTTGAATTCGACGGTGACACGG
CAAAGCTTGCTGATGCGGAGACGTTTCTGTTTCATCTTCTTAAATCCGTGCCAACCGCGTTTACGAGACT
AAACGCGTTTCTCTTTAGGGCTAATTATTATCCAGAGATGGCTCATCATAGCAAATGTTTACAAACGTTG
GATTTAGCTTGTAAAGAGCTGAGATCTCGTGGCTTGTTTGTGAAGCTTTTGGAGGCAATACTTAAAGCTG
GAAACAGAATGAACGCGGGTACCGCGAGAGGAAACGCTCAAGCGTTTAATCTAACCGCGCTTTTGAAGCT
TTCGGATGTTAAAAGCGTTGATGGGAAGACTTCTTTGCTTAACTTTGTAGTGGAGGAAGTTGTTAGATCG
GAAGGAAAACGTTGTGTTATGAATAGAAGAAGCCATAGCTTAACACGAAGCGGTAGTAGTAACTACAATG
GTGGTAATAGTAGTCTTCAGGTTATGTCGAAAGAAGAGCAAGAGAAAGAGTACTTGAAGCTTGGTTTACC
AGTTGTTGGTGGATTGAGCTCTGAGTTTTCAAACGTGAAGAAAGCTGCTTGTGTGGACTATGAAACGGTT
GTTGCAACTTGTTCTGCTCTTGCGGTTAGAGCGAAAGATGCGAAAACGGTGATTGGAGAATGTGAAGATG
GAGAAGGAGGGAGGTTTGTGAAAACGATGATGACGTTTCTTGATTCGGTAGAGGAAGAGGTGAAAATAGC
GAAAAAAAAAAAAAAAA 

 


7.4

Design PCR primers for detection of a 400 to 700 bp long diagnostic fragment of the cDNA assembled in Task 7.3 using NCBI Primer-Blast or Primer3. Try to select primers that would distinguish between a product amplified from the cDNA and contaminating genomic DNA in a RT-PCR experiment (hint: the Macaw alignment comes handy for this). Present a graphical map of the locus, showing positions of the primers and the cDNA exons mapped to the genomic locus.