Topic 2: Public data resources
Sequence databases, searching, downloading. Genome sites and other "added value" resources. Tools for data access and manipulation.
Where can I obtain sequence data?
- Public comprehensive databases: ENA (EMBL), GenBank, DDBJ (DNA + protein)
- Public specialized/selective databases - examples:
- by molecule type + additional available data (UniProt)
- by organism e.g. Plant GenIE, ThaleMine
- by funding agency/institution that has sequenced the genomes (DOE Joint Genome Institute)
- ... of course, experimentally, and then you should deposit to the public databases.
The NCBI portal
- more than just sequences:- Literature - PubMed/MEDLINE + some online journals + books
- GenBank
- 3D structures
- Genomes
- Links to similarity searches (seeLesson 3)
Searching GenBank:
- database subsets - DNA (incl. EST, STS, GSS ...), protein - Limits function
- unique accession numbers
- anatomy of a GenBank record, saving options ...
EMBL/EBI database portal
- a smart but complicated way to access EMBL (has a good help)
Genome sites - the Arabidopsis example:
- TAIR ... (most contents now paid but have a look e.g. at the Insertion, Knockout and Mutation resources!)
- SIGnAL
- AtGDB
- ThaleMine
A frequently used general format for displaying and organizing genome information: JBrowse
Transcriptome sites (demo):
- EBI ArrayExpress
- Gene Expression Omnibus (GEO)
- ePlant and eFPBrowser ("electronic fluorescent protein browser"), for multiple plant species is part of then BAR Tools)
- Genevestigator
(only very basic functions are free, multiple organisms)
Ontology:
Controlled dictionary to designate biological processes, structures, conditions, etc.
- Biomart: tool to bulk download genetic data, with the option to apply filters. (general and plant version, help and tutorials)
- The PANTHER classification
system
Last but not least: finding literature:
- NCBI - PubMed
- Google Scholar
- Web of Science (paid - University has access)
- What about artificial intelligence? Try Perplexity!)
Tasks
2.1
Obtain and inspect the sequence of the pGWB4 cloning vector. Examine the various sequence formats available and the annotations. Using the knowledge from the previous lesson, construct a map of pGWB4.2.2
Search the protein section of the GenBank/EMBL database for poplar (Populus sp.) extensins.- Save at least five of the sequences you have found as a "multiple" FASTA file and keep them for future use. What is peculiar about the extensins?
- Select one of the sequences and explore the "Related information" section (to the right from sequence description). Find the corresponding nucleotide sequence and examine the sequence file.
2.3
Search the protein section of the GenBank/EMBL database for Rab GDP dissociation inhibitors of Arabidopsis thaliana.- Download all the "Refseq" sequences you have found, save them as a "multiple" FASTA file, discard sequences that are not GDI )and keep the rest for future use. How many different RabGDIs did you find?
- Explore the annotation of these sequences, paying attention to the locus identifiers (either in the "At1g59810" - i.e. "genomic" - or the "F21G12.120" - "BAC" - format). Use the genomic identifiers to annotate the sequences in the FASTA file produced in the previous step.Keep the file for future use.
- Demo: Visit the SIGnAL database and search for available information on the loci you have found, using the genomic identiiers for a "Gene search". .Examine the expression of at least one of the found loci using BAR tools.
2.4
Use the plant
BIOMART tool to, find all genes that are actin binding (using gene ontology
term) in the Arabidopsis thaliana (TAIR10) database. In the final gene list,
find the number of genes and identify
at least 3 gene families, whose members bind actin in Arabidopsis thaliana.