Topic 1: Basic handling of sequence data

Introduction


What do we have - and  what we don't have?


Sequence data:

types (protein, DNA, RNA), IUPAC codes, conventions.

Other data:

genomic maps, 3D structure files, alignments, patterns, microarray hybridization data ... we'll come to them.

Sequence file formats:

>sequence_name anything behind the first gap is a comment
thisisasequenceinfastaformat

Converting sequences from one format to another:


Orientation in a sequence:


In silico cloning:


Tasks

In every lesson, some (partial) tasks are described in color. These are "control points" required for getting course credits in the Faculty of Sciences course. At least four "colored" tasks, each of a different color, have to be presented.


1.1

You have just sequenced an Arabidopsis cDNA fragment cloned in the EcoRI site of the pBluescriptII SK- vector and obtained the following raw sequence:

CCGCGGTGGCGGCCGCTCTAGAACTAGTGGATCCCCCGGGCTGCAGGAAT
TCGGTGACCCCGGCAAAGCTTGCTTAATCCGAAGACGTTTCTGTTTCATC
TTCTTAAATCCGGGCCAACNGCGTTTACGAGACTAAACGCGTTTCTCTTT
AGGGCTTAATTATTATCCAGAGATGGCTCATCATAGCAAATGTTTACAAA
CGTTGGATTTAGCTTGTAAAGAGCTGAGATCTCGTGGCTTGTTTGTGAAG
CTTTTGGAGGCAATACTTAAAGCTGGAAACAGAATGAACGCGGGTACCGC
GAGAGGAAACGCTCAAGCGTTTAATCTAACCGCGCTTTTGAAGCTTTCGG
ATGTTAAAAGCGTTGATGGGAAGACTTCTTTGCTTAACTTTGTAGTGGAG
GAAGTTGTTAGATCGGAAGGAAAACGTTGTGTTATGAATAGAAGAAGCCA
TAGCTTAACACGAAGCGGTAGTAGTAACTACAATGGTGGTAATAGTAGTC
TTCAGGTTATGTCGAAAGAAGAGCAAGAGAAAGAGTACTTGAAGCTTGGT
TTACCAGTTGTTGGTGGATTGAGCTCTGAGTTTTCAAACGTGAAGAAAGC
TGCTTGTGTGGACTATGAAACGGTTGTTGCAACTTGTTCTGCTCTTGCGG
TTAGAGCGAAAGATGCGAAAACGGTGATTGGAGAATGTGAAGATGGAGAA
GGAGGGAGGTTTGTGAAAACGATGATGACGTTTCNTGATTCGGTAGAGGA
AGAGGTGAAAATAGCGAAAGGTGAAGAGAGGAAAGTGATCCCTGA

 

  1. Copy the sequence into a text file and convert it to the FASTA format.
  2. Perform restriction analysis to identify sites from the vector polylinker and remove vector sequences so that you have only clean cDNA left (Recommended program: SMS Restriction Digest). Check the vector-insert boundary using the map provided below and remove vector sequences. Save the clean cDNA sequence (in FASTA format) as a text file.


1.2

You have just sequenced the following  Arabidopsis cDNA fragment. Note the format of the sequence: a header line (>mysequence) followed by the rest of the sequence, which means FASTA format.

>cDNA_2_At
CAAGCAACCTTTACACATATAGAAGAAGAAAAACACTTCTTTGTTTCTGTCATTAATTCCCTCCCTCTAT
ATATATATATTTAAATCTATTATGACAAAACAATCCAATACTGGATACTTTTTACAACAACATGCACAGA
CAAAGCTGAGATCTCACCTTAAGAAACTAATGAGATTGAGTTATGTTCGTCTTCATCAGACGAAGAGGTT
GAAGACGACGACGAAGAAGAAGAAGATTGTCTTCGTCCAACGAGTCCAGGAAGAGGTTGTGGCATCATTG
GATTCACTGGAACAGGAAACTTATGAGCAGAACTAACCATTGTTCTTTCGTTTATCATCCCTACTTCTTT
GCAAACTCTGTCTACTACTCCAAGGAAGTCTCTAACCACCAAGAATATTCTAAACGGATGCGCTTCTTCT
TTAGCCGAGTTTCCATGGAAATACTCTGTGATTTCTTTTACAAGTGATAACGCTACGCTCTCTTGAGCTT
GTACTCTGATGATCTCTTCCTCAGCTCTTTTCAGAAACNTTTTCATCGATTCCGNNAACCTCTGACTGTT
GCTTTCTTCTGTGATTGTTGATTGGACTTGGATTGCTTCGTTGATCTTGGCAATGNNTTGAGGAAAGCTT
GGAGANGTAGCTGCTTAGTACTTCTGAGTCCATCNNAGC

  1. Convert the sequence to antisense (reverse complement) and keep it as a FASTA file using SMS Reverse Complement.
  2. Look at the coding capacity of your cDNA in both directions (using SMS ORF finder). Select the longest reading frame and keep its protein translation as a FASTA file. (NOTE: you might not necessarily get a reasonable ORF - assume that this was a raw first run sequence, containing possible frameshifts and other errors).

1.3

The following DNA sequence was assembled from two files, each of them a different format.

   1 TTTAATAAAA TAAAAATCCA CTCGCATTTT TATTTTCAAC ATTGTGCGTA    50
  51 CGGTGCAATT CAATGAACAG TGTTTACTTT CAGTGTGTAC ACTTCTGCGG   100
 101 ACTATTACAA AGTCCACGTC TTATCCTACG TGTTATAATC TCATATGTTA   150
 151 CTGTCTGAAA TGGACCCCAC TACGTAAAAA TAAAATTAAG AATCAACCAC   200
 201 TCTTCTTCCA TCACCTCTTT TGGCTTTCTC TCTACTCTCT CTACTACTCT   250
 251 CTCACCATCA CTGAGTTAAG AGAACAAACC AAAAACAAAA TTATCAAACC   300
 301 ATCACCAGCA GAATCTTAGC TGGATTCATC ACTCTATTCA AAAAGTTTCT   350
 351 CTCTTCTCTT TTCTCAGATC TTGAACTCTT GAAGAAGAAA GAAGAAGATA   400
 401 ACACAATGCT CTTCTTCTTA TTCTTCTTCT ACTTACTCTT ATCTTCATCC   450
 451 TCCGATCTAG TCTTCGCCGA CCGTCGTGTA CTCCACGAAC CATTCTTCCC   500
 501 TATAGATTCA CCACCACCGT CACCACCATC ACCACCACCA CTTCCTAAAC   550
 551 TACCATTCTC TTCAACCACT CCTCCATCTT CATCAGACCC AAATGCTTCT   600
 601 CCTTTCTTCC CTTTATACCC TTCATCTCCA CCACCACCTT CTCCAGCCTC   650
 651 CTTCGCTTCT TTTCCGGCGA ATATCTCATC TCTAATCGTC CCTCACGCCA   700
 701 CTAAATCCCC ACCTAACTCC AAAAAACTCC TTATCGTCGC TATCTCCGCC   750
 751 GTTTCCTCCG CTGCTTTAGT CGCTCTACTT ATCGCTTTAC TCTATTGGCG   800
 801 AAGAAGCAAA CGTAACCAAG ATCTTAACTT CTCCGATGAT AGCAAAACAT   850
 851 ACACCACCGA CAGTAGCCGC CGTGTCTACC CTCCTCCTCC GGCAACGGCG   900
 901 CCTCCAACAC GACGCAATGC GGAGGCTAGA AGTAAACAGA GGACCACCAC   950
 951 GAGCTCCACC AATAACAACA GCTCTGAGTT TCTTTACTTA GGAACAATGG  1000
1001 TGAATCAAAG AGGAATCGAT GAACAATCTC TTAGTAATAA TGGATCAAGC  1050
1051 TCAAGAAAAC TTGAATCTCC AGATCTTCAA CCACTTCCTC CATTGATGAA  1100
1101 ACGAAGTTTC CGTTTAAATC CAGATGTTGG TTCAATCGGA GAAGAAGATG  1150
1151 AAGAAGATGA GTTTTACTCT CCACGTGGCT CACAAAGCGG GCGAGAACCG  1200
1201 TTAAACCGGG TCGGACTTCC GGGTCAAAAT CCTAGATCTG TTAACAATGA  1250
1251 CACTATCTCT TGCTCATCTT CAAGCTCTGG TTCACCAGGA AGATCAACAT  1300
1301 TTATCAGTAT CTCTCCTTCA ATGAGTCCTA AGAGATCTGA ACCAAAACCG  1350
1351 CCGGTTATCT CCACACCAGA ACCGGCGGAG TTAACCGATT ATAGATTTGT  1400
1401 TCGGTCTCCG TCACTGTCGT TAGCTTCTTT ATCGTCGGGA TTGAAAAACT  1450
1451 CCGATGAAGT AGGATTGAAT CAAATCTTTA GATCTCCGAC GGTTACATCT  1500
1501 CTAACAACTT CACCGGAGAA TAACAAAAAA GAGAACTCTC CATTATCATC  1550
1551 TACTTCAACT TCACCGGAAC GACGACCAAA TGATACACCA GAAGCTTACT  1600
TGAGATCTCCGTCGCATTCTTCTGCTTCTACATCACCGTATAGATGTTTT
CAGAAATCTCCGGAGGTCTTACCGGCGTTTATGAGTAATCTCCGGCAAGG
TTTGCAATCTCAGTTACTATCTTCTCCTTCTAACTCTCATGGAGGACAAG
GTTTCCTTAAGCAGTTAGATGCATTACGTTCTCGTTCACCGTCGTCGTCT
TCTTCTTCTGTTTGTTCTTCACCGGAGAAAGCTTCTCATAAGTCACCAGT
TACATCTCCTAAGTTATCTTCCCGGAATTCGCAGTCTCTATCATCTTCTC
CGGATAGAGATTTTAGTCATAGCTTAGATGTATCACCACGGATATCGAAC
ATTTCACCTCAAATTTTACAGTCTCGTGTGCCTCCGCCTCCTCCTCCTCC
CCCACCGTTGCCGTTGTGGGGACGACGGAGTCAGGTGACTACTAAAGCGG
ACACAATCTCGAGACCGCCTTCTCTTACACCGCCTTCACATCCTTTTGTG
ATCCCATCTGAAAACTTACCAGTGACTTCGTCTCCTATGGAGACTCCAGA
GACGGTTTGTGCGAGTGAGGCGGCGGAGGAAACTCCGAAACCGAAGCTAA
AGGCGTTACATTGGGATAAAGTTAGAGCAAGTTCGGATCGTGAGATGGTT
TGGGATCATCTTCGATCAAGCTCTTTCAAGTGAGTTAATGTGACATACTC
GTTTATATGATACTATATGCTTTTAGTGAGAATGTGGTTGTTGAGATTAT
GAATGTGGTTTGCAGATTAGATGAGGAGATGATTGAGACGTTGTTTGTGG
CGAAGTCGTTAAACAACAAACCAAATCAGAGTCAGACAACTCCAAGATGT
GTTCTCCCGAGCCCGAACCAAGAGAACAGAGTCCTGGACCCGAAGAAGGC
TCAGAATATTGCCATCTTGCTTCGTGCACTTAATGTCACTATAGAAGAAG
TTTGTGAGGCTCTTCTTGAAGGTAAACTATGCTGTCACATACATAGTTTC
TCATTTTCTTCTCCTTTGATCTCCAGAATTAGAGTTCTTATGCATTTGTT
AATGGTTTTTCGATGATATGGTTGAGTTATTCTGAAAGCTTTGCTTCTTT
GATGGTGTGGAGATTCTTGGTTACATTGATGTTCTTAGTTATGCTTTTTC
AGGCAATGCTGATACACTGGGGACTGAACTTCTTGAGAGCTTACTGAAGA
TGGCACCGACAAAAGAAGAAGAGCGCAAGTTGAAAGCGTACAATGATGAT
TCGCCTGTTAAGCTTGGACATGCTGAGAAATTCCTTAAGGCAATGTTGGA
CATCCCTTTCGCCTTTAAAAGAGTTGATGCAATGCTCTATGTAGCCAACT
TTGAGTCCGAGGTTGAATACTTGAAGAAATCTTTTGAGACTCTTGAGGTA
TATATTACAAGCTATTCTCTCTCTTTTTACCATATGGTTGTATTGTAACA
GATTATGACTTCATTTCTATTGTTTGTGTAGGCTGCTTGTGAAGAACTGA
GGAACAGTAGGATGTTCTTAAAGCTTCTTGAAGCGGTTCTAAAGACAGGA
AACCGTATGAACGTTGGAACAAACCGAGGAGATGCACATGCGTTCAAGCT
TGATACACTTCTCAAGCTAGTCGATGTCAAAGGCGCTGATGGGAAAACAA
CTCTCTTGCATTTCGTTGTACAAGAGATAATCCGAGCAGAAGGCACACGT
CTCTCAGGTAACAATACACAAACAGATGACATTAAATGCCGGAAACTAGG
TCTCCAAGTTGTATCAAGTCTCTGTTCTGAGCTTAGTAACGTCAAGAAAG
CTGCTGCGATGGACTCAGAAGTACTAAGCAGCTACGTCTCCAAGCTTTCT
CAAGGCATTGCCAAGATCAACGAAGCAATCCAAGTCCAATCAACAATCAC
AGAAGAAAGCAACAGTCAGAGGTTTTCGGAATCGATGAAAACGTTTCTGA
AAAGAGCTGAGGAAGAGATCATCAGAGTACAAGCTCAAGAGAGCGTAGCG
TTATCACTTGTAAAAGAAATCACAGAGTATTTCCATGGAAACTCGGCTAA
AGAAGAAGCGCATCCGTTTAGAATATTCTTGGTGGTTAGAGACTTCCTTG
GAGTAGTAGACAGAGTTTGCAAAGAAGTAGGGATGATAAACGAAAGAACA
ATGGTTAGTTCTGCTCATAAGTTTCCTGTTCCAGTGAATCCAATGATGCC
ACAACCTCTTCCTGGACTCGTTGGACGAAGACAATCTTCTTCTTCTTCGT
CGTCGTCTTCAACCTCTTCGTCTGATGAAGACGAACATAACTCAATCTCA
TTAGTTTCTTAAGGTGAGATCTCAGCTTTGTCTGTGCATGTTGTTGTAAA
AAGTATCCAGTATTGGATTGTTTTGTCATAATAGATTTAAATATATATAT
ATAGAGGGAGGGAATTAATGACAGAAACAAAGAAGTGTTTTTCTTTTCTG
CATTTGTGTAAAAAAAATAATATAGGTTTACCTTAAAATTTGTTCATCTT
AAATTAATAATTTAAGAATCAAATAAATTTGTTTATCTGAACCGTGTGTA
CCACGAAAGAATGTGAGAGCAAACATATTACTTACTTACCCTTCGTTGCT
GAATATAATGATCATTATAAATCACTACCTCCAGTACCTTCTACCTTCTT
CAAAGAACCTTGTTGGATTTGAACCAAAGTTGGAACATAATTGACGAGAG
GTGAGCATCTAGATTCTGCATCGTGATGATGATCCACTTTTATCTATTTA

  1. Convert the sequence to FASTA (recommended: SMS Filter DNA).
  2. Construct its restriction map using SMS Restriction Map, WebCutter.or Molbiotools Restriction Analyzer
  3. A "virtual cloning" excercise: Take the  EcoRI-XhoI fragment and clone it into pBluescript SK- (sequence and map can be found here in the PlasmaDNA format . Create a map of the product using PlasmaDNA. Save any fasta intermediates as *.fasta.

1.4

Using SAVVY, construct a map of a plasmid, pTEST, of the following properties:

Hints for saving the output: