Topic 1: Basic handling of sequence data
Introduction
- use of course workstations - data handling (use either your folder on the server or your own USB stick).
What do we have - and what we don't have?
-
a computer connected to the Internet (incl. Web browser)
-
a text editor (Notepad or better)
-
public databases of genomic sequences, cDNA + EST, protein sequences, structures and motifs, gene expression data etc.
-
money for specialised software packages
-
public servers capable of (almost) anything we wish to do
- Identification of relevant sequence portions, file formats, reformatting utilities.
- Restriction site analysis and translation of DNA sequences.
- Construction of graphical maps, in silico cloning.
Sequence data:
types (protein, DNA, RNA), IUPAC codes, conventions.Other data:
genomic maps, 3D structure files, alignments, patterns, microarray hybridization data ... we'll come to them.Sequence file formats:
-
Make sure you have the correct format.
-
FASTA format is (almost) always correct.
>sequence_name anything behind the first gap is a comment
thisisasequenceinfastaformat
-
If not, you can (almost) always use raw data.
thisisasequenceinrawformat
-
There are more formats, such as EMBL, GenBank and others such as those generated by automatic sequencers, but you don't need to know how to write these by hand!
-
If things do not work, check for gaps in sequence, empty lines, and the file extension (some programs are sensitive).
-
Beware of Microsoft!
Converting sequences from one format to another:
-
by hand - boring!
-
utilities - network or local (SMS; we are using a LAN server in the course). These can do other things as well, such as reverse complement, translation, restriction analysis etc. A variety of other utilities can be found here, feel free to explore.
Orientation in a sequence:
-
translation
-
restriction mapping
In silico cloning:
- PlasmaDNA, SerialCloner (local applications have to be installed)
- Nice online tools here (Molbiotools)
- SAVVY
Tasks
In every lesson, some (partial) tasks are described in color. These are "control points" required for getting course credits in the Faculty of Sciences course. At least four "colored" tasks, each of a different color, have to be presented.
1.1
You have just sequenced an Arabidopsis cDNA fragment cloned in the EcoRI site of the pBluescriptII SK- vector and obtained the following raw sequence:CCGCGGTGGCGGCCGCTCTAGAACTAGTGGATCCCCCGGGCTGCAGGAAT
TCGGTGACCCCGGCAAAGCTTGCTTAATCCGAAGACGTTTCTGTTTCATC
TTCTTAAATCCGGGCCAACNGCGTTTACGAGACTAAACGCGTTTCTCTTT
AGGGCTTAATTATTATCCAGAGATGGCTCATCATAGCAAATGTTTACAAA
CGTTGGATTTAGCTTGTAAAGAGCTGAGATCTCGTGGCTTGTTTGTGAAG
CTTTTGGAGGCAATACTTAAAGCTGGAAACAGAATGAACGCGGGTACCGC
GAGAGGAAACGCTCAAGCGTTTAATCTAACCGCGCTTTTGAAGCTTTCGG
ATGTTAAAAGCGTTGATGGGAAGACTTCTTTGCTTAACTTTGTAGTGGAG
GAAGTTGTTAGATCGGAAGGAAAACGTTGTGTTATGAATAGAAGAAGCCA
TAGCTTAACACGAAGCGGTAGTAGTAACTACAATGGTGGTAATAGTAGTC
TTCAGGTTATGTCGAAAGAAGAGCAAGAGAAAGAGTACTTGAAGCTTGGT
TTACCAGTTGTTGGTGGATTGAGCTCTGAGTTTTCAAACGTGAAGAAAGC
TGCTTGTGTGGACTATGAAACGGTTGTTGCAACTTGTTCTGCTCTTGCGG
TTAGAGCGAAAGATGCGAAAACGGTGATTGGAGAATGTGAAGATGGAGAA
GGAGGGAGGTTTGTGAAAACGATGATGACGTTTCNTGATTCGGTAGAGGA
AGAGGTGAAAATAGCGAAAGGTGAAGAGAGGAAAGTGATCCCTGA
- Copy the sequence into a text file and convert it to the FASTA format.
- Perform restriction analysis to identify sites from the vector polylinker and remove vector sequences so that you have only clean cDNA left (Recommended program: SMS Restriction Digest). Check the vector-insert boundary using the map provided below and remove vector sequences. Save the clean cDNA sequence (in FASTA format) as a text file.
1.2
You have just sequenced the following Arabidopsis cDNA fragment. Note the format of the sequence: a header line (>mysequence) followed by the rest of the sequence, which means FASTA format.>cDNA_2_At
CAAGCAACCTTTACACATATAGAAGAAGAAAAACACTTCTTTGTTTCTGTCATTAATTCCCTCCCTCTAT
ATATATATATTTAAATCTATTATGACAAAACAATCCAATACTGGATACTTTTTACAACAACATGCACAGA
CAAAGCTGAGATCTCACCTTAAGAAACTAATGAGATTGAGTTATGTTCGTCTTCATCAGACGAAGAGGTT
GAAGACGACGACGAAGAAGAAGAAGATTGTCTTCGTCCAACGAGTCCAGGAAGAGGTTGTGGCATCATTG
GATTCACTGGAACAGGAAACTTATGAGCAGAACTAACCATTGTTCTTTCGTTTATCATCCCTACTTCTTT
GCAAACTCTGTCTACTACTCCAAGGAAGTCTCTAACCACCAAGAATATTCTAAACGGATGCGCTTCTTCT
TTAGCCGAGTTTCCATGGAAATACTCTGTGATTTCTTTTACAAGTGATAACGCTACGCTCTCTTGAGCTT
GTACTCTGATGATCTCTTCCTCAGCTCTTTTCAGAAACNTTTTCATCGATTCCGNNAACCTCTGACTGTT
GCTTTCTTCTGTGATTGTTGATTGGACTTGGATTGCTTCGTTGATCTTGGCAATGNNTTGAGGAAAGCTT
GGAGANGTAGCTGCTTAGTACTTCTGAGTCCATCNNAGC
- Convert the sequence to antisense (reverse complement) and keep it as a FASTA file using SMS Reverse Complement.
- Look at the coding capacity of your cDNA in both directions (using SMS ORF finder). Select the longest reading frame and keep its protein translation as a FASTA file. (NOTE: you might not necessarily get a reasonable ORF - assume that this was a raw first run sequence, containing possible frameshifts and other errors).
1.3
The following DNA sequence was assembled from two files, each of them a different format. 1 TTTAATAAAA TAAAAATCCA
CTCGCATTTT TATTTTCAAC ATTGTGCGTA 50
51 CGGTGCAATT CAATGAACAG TGTTTACTTT CAGTGTGTAC ACTTCTGCGG
100
101 ACTATTACAA AGTCCACGTC TTATCCTACG TGTTATAATC TCATATGTTA
150
151 CTGTCTGAAA TGGACCCCAC TACGTAAAAA TAAAATTAAG AATCAACCAC
200
201 TCTTCTTCCA TCACCTCTTT TGGCTTTCTC TCTACTCTCT CTACTACTCT
250
251 CTCACCATCA CTGAGTTAAG AGAACAAACC AAAAACAAAA TTATCAAACC
300
301 ATCACCAGCA GAATCTTAGC TGGATTCATC ACTCTATTCA AAAAGTTTCT
350
351 CTCTTCTCTT TTCTCAGATC TTGAACTCTT GAAGAAGAAA GAAGAAGATA
400
401 ACACAATGCT CTTCTTCTTA TTCTTCTTCT ACTTACTCTT ATCTTCATCC
450
451 TCCGATCTAG TCTTCGCCGA CCGTCGTGTA CTCCACGAAC CATTCTTCCC
500
501 TATAGATTCA CCACCACCGT CACCACCATC ACCACCACCA CTTCCTAAAC
550
551 TACCATTCTC TTCAACCACT CCTCCATCTT CATCAGACCC AAATGCTTCT
600
601 CCTTTCTTCC CTTTATACCC TTCATCTCCA CCACCACCTT CTCCAGCCTC
650
651 CTTCGCTTCT TTTCCGGCGA ATATCTCATC TCTAATCGTC CCTCACGCCA
700
701 CTAAATCCCC ACCTAACTCC AAAAAACTCC TTATCGTCGC TATCTCCGCC
750
751 GTTTCCTCCG CTGCTTTAGT CGCTCTACTT ATCGCTTTAC TCTATTGGCG
800
801 AAGAAGCAAA CGTAACCAAG ATCTTAACTT CTCCGATGAT AGCAAAACAT
850
851 ACACCACCGA CAGTAGCCGC CGTGTCTACC CTCCTCCTCC GGCAACGGCG
900
901 CCTCCAACAC GACGCAATGC GGAGGCTAGA AGTAAACAGA GGACCACCAC
950
951 GAGCTCCACC AATAACAACA GCTCTGAGTT TCTTTACTTA GGAACAATGG 1000
1001 TGAATCAAAG AGGAATCGAT GAACAATCTC TTAGTAATAA TGGATCAAGC 1050
1051 TCAAGAAAAC TTGAATCTCC AGATCTTCAA CCACTTCCTC CATTGATGAA 1100
1101 ACGAAGTTTC CGTTTAAATC CAGATGTTGG TTCAATCGGA GAAGAAGATG 1150
1151 AAGAAGATGA GTTTTACTCT CCACGTGGCT CACAAAGCGG GCGAGAACCG 1200
1201 TTAAACCGGG TCGGACTTCC GGGTCAAAAT CCTAGATCTG TTAACAATGA 1250
1251 CACTATCTCT TGCTCATCTT CAAGCTCTGG TTCACCAGGA AGATCAACAT 1300
1301 TTATCAGTAT CTCTCCTTCA ATGAGTCCTA AGAGATCTGA ACCAAAACCG 1350
1351 CCGGTTATCT CCACACCAGA ACCGGCGGAG TTAACCGATT ATAGATTTGT 1400
1401 TCGGTCTCCG TCACTGTCGT TAGCTTCTTT ATCGTCGGGA TTGAAAAACT 1450
1451 CCGATGAAGT AGGATTGAAT CAAATCTTTA GATCTCCGAC GGTTACATCT 1500
1501 CTAACAACTT CACCGGAGAA TAACAAAAAA GAGAACTCTC CATTATCATC 1550
1551 TACTTCAACT TCACCGGAAC GACGACCAAA TGATACACCA GAAGCTTACT 1600
TGAGATCTCCGTCGCATTCTTCTGCTTCTACATCACCGTATAGATGTTTT
CAGAAATCTCCGGAGGTCTTACCGGCGTTTATGAGTAATCTCCGGCAAGG
TTTGCAATCTCAGTTACTATCTTCTCCTTCTAACTCTCATGGAGGACAAG
GTTTCCTTAAGCAGTTAGATGCATTACGTTCTCGTTCACCGTCGTCGTCT
TCTTCTTCTGTTTGTTCTTCACCGGAGAAAGCTTCTCATAAGTCACCAGT
TACATCTCCTAAGTTATCTTCCCGGAATTCGCAGTCTCTATCATCTTCTC
CGGATAGAGATTTTAGTCATAGCTTAGATGTATCACCACGGATATCGAAC
ATTTCACCTCAAATTTTACAGTCTCGTGTGCCTCCGCCTCCTCCTCCTCC
CCCACCGTTGCCGTTGTGGGGACGACGGAGTCAGGTGACTACTAAAGCGG
ACACAATCTCGAGACCGCCTTCTCTTACACCGCCTTCACATCCTTTTGTG
ATCCCATCTGAAAACTTACCAGTGACTTCGTCTCCTATGGAGACTCCAGA
GACGGTTTGTGCGAGTGAGGCGGCGGAGGAAACTCCGAAACCGAAGCTAA
AGGCGTTACATTGGGATAAAGTTAGAGCAAGTTCGGATCGTGAGATGGTT
TGGGATCATCTTCGATCAAGCTCTTTCAAGTGAGTTAATGTGACATACTC
GTTTATATGATACTATATGCTTTTAGTGAGAATGTGGTTGTTGAGATTAT
GAATGTGGTTTGCAGATTAGATGAGGAGATGATTGAGACGTTGTTTGTGG
CGAAGTCGTTAAACAACAAACCAAATCAGAGTCAGACAACTCCAAGATGT
GTTCTCCCGAGCCCGAACCAAGAGAACAGAGTCCTGGACCCGAAGAAGGC
TCAGAATATTGCCATCTTGCTTCGTGCACTTAATGTCACTATAGAAGAAG
TTTGTGAGGCTCTTCTTGAAGGTAAACTATGCTGTCACATACATAGTTTC
TCATTTTCTTCTCCTTTGATCTCCAGAATTAGAGTTCTTATGCATTTGTT
AATGGTTTTTCGATGATATGGTTGAGTTATTCTGAAAGCTTTGCTTCTTT
GATGGTGTGGAGATTCTTGGTTACATTGATGTTCTTAGTTATGCTTTTTC
AGGCAATGCTGATACACTGGGGACTGAACTTCTTGAGAGCTTACTGAAGA
TGGCACCGACAAAAGAAGAAGAGCGCAAGTTGAAAGCGTACAATGATGAT
TCGCCTGTTAAGCTTGGACATGCTGAGAAATTCCTTAAGGCAATGTTGGA
CATCCCTTTCGCCTTTAAAAGAGTTGATGCAATGCTCTATGTAGCCAACT
TTGAGTCCGAGGTTGAATACTTGAAGAAATCTTTTGAGACTCTTGAGGTA
TATATTACAAGCTATTCTCTCTCTTTTTACCATATGGTTGTATTGTAACA
GATTATGACTTCATTTCTATTGTTTGTGTAGGCTGCTTGTGAAGAACTGA
GGAACAGTAGGATGTTCTTAAAGCTTCTTGAAGCGGTTCTAAAGACAGGA
AACCGTATGAACGTTGGAACAAACCGAGGAGATGCACATGCGTTCAAGCT
TGATACACTTCTCAAGCTAGTCGATGTCAAAGGCGCTGATGGGAAAACAA
CTCTCTTGCATTTCGTTGTACAAGAGATAATCCGAGCAGAAGGCACACGT
CTCTCAGGTAACAATACACAAACAGATGACATTAAATGCCGGAAACTAGG
TCTCCAAGTTGTATCAAGTCTCTGTTCTGAGCTTAGTAACGTCAAGAAAG
CTGCTGCGATGGACTCAGAAGTACTAAGCAGCTACGTCTCCAAGCTTTCT
CAAGGCATTGCCAAGATCAACGAAGCAATCCAAGTCCAATCAACAATCAC
AGAAGAAAGCAACAGTCAGAGGTTTTCGGAATCGATGAAAACGTTTCTGA
AAAGAGCTGAGGAAGAGATCATCAGAGTACAAGCTCAAGAGAGCGTAGCG
TTATCACTTGTAAAAGAAATCACAGAGTATTTCCATGGAAACTCGGCTAA
AGAAGAAGCGCATCCGTTTAGAATATTCTTGGTGGTTAGAGACTTCCTTG
GAGTAGTAGACAGAGTTTGCAAAGAAGTAGGGATGATAAACGAAAGAACA
ATGGTTAGTTCTGCTCATAAGTTTCCTGTTCCAGTGAATCCAATGATGCC
ACAACCTCTTCCTGGACTCGTTGGACGAAGACAATCTTCTTCTTCTTCGT
CGTCGTCTTCAACCTCTTCGTCTGATGAAGACGAACATAACTCAATCTCA
TTAGTTTCTTAAGGTGAGATCTCAGCTTTGTCTGTGCATGTTGTTGTAAA
AAGTATCCAGTATTGGATTGTTTTGTCATAATAGATTTAAATATATATAT
ATAGAGGGAGGGAATTAATGACAGAAACAAAGAAGTGTTTTTCTTTTCTG
CATTTGTGTAAAAAAAATAATATAGGTTTACCTTAAAATTTGTTCATCTT
AAATTAATAATTTAAGAATCAAATAAATTTGTTTATCTGAACCGTGTGTA
CCACGAAAGAATGTGAGAGCAAACATATTACTTACTTACCCTTCGTTGCT
GAATATAATGATCATTATAAATCACTACCTCCAGTACCTTCTACCTTCTT
CAAAGAACCTTGTTGGATTTGAACCAAAGTTGGAACATAATTGACGAGAG
GTGAGCATCTAGATTCTGCATCGTGATGATGATCCACTTTTATCTATTTA
- Convert the sequence to FASTA (recommended: SMS Filter DNA).
- Construct its restriction map using SMS Restriction Map, WebCutter.or Molbiotools Restriction Analyzer
- A "virtual cloning" excercise: Take the EcoRI-XhoI fragment and clone it into pBluescript SK- (sequence and map can be found here in the PlasmaDNA format . Create a map of the product using PlasmaDNA. Save any fasta intermediates as *.fasta.
1.4
Using SAVVY, construct a map of a plasmid, pTEST, of the following properties:
- total size 4.7 kb
- insert "MyGene" with 5´end defined by an EcoRI site at position 1 and 3´end at a SacI site at position 1950 bp
- bacterial origin of replication ("Ori") between position 2800 and 3100 .
- kanamycin resistance "KanR" between position 3400 and 4500 in the reverse orientation.
Hints for saving the output: