Topic 2: Public data resources

Sequence databases, searching, downloading. Genome sites and other "added value" resources. Tools for data access and manipulation.

Where can I obtain sequence data?

Public comprehensive databases: ENA (EMBL), GenBank, DDBJ (DNA + protein)
Public specialized/selective databases - examples:

by molecule type + additional available data (UniProt)
by organism e.g. Plant GenIE, ThaleMine
by funding agency/institution that has sequenced the genomes (DOE Joint Genome Institute)

... of course, experimentally, and then you should deposit to the public databases.

Selected resources will be demonstrated and used in the practical tasks.

The NCBI portal

- more than just sequences:

Literature - PubMed/MEDLINE + some online journals + books
GenBank
3D structures
Genomes
Links to similarity searches (seeLesson 3)

Searching GenBank:

database subsets - DNA (incl. EST, STS, GSS ...), protein - Limits function
unique accession numbers
anatomy of a GenBank record, saving options ...

EMBL/EBI database portal

a smart but complicated way to access EMBL (has a good help)

Genome sites - the Arabidopsis example:

TAIR ... (most contents now paid but have a look e.g. at the Insertion, Knockout and Mutation resources!)
SIGnAL
AGD
ThaleMine

A frequently used general format for displaying and organizing genome information: JBrowse

Transcriptome sites (demo):

EBI ArrayExpress
Gene Expression Omnibus (GEO)
ePlant and eFPBrowser ("electronic fluorescent protein browser"), for multiple plant species is part of then BAR Tools)
Genevestigator (only very basic functions are free, multiple organisms)

Ontology:

Controlled dictionary to designate biological processes, structures, conditions, etc.

Biomart: tool to bulk download genetic data, with the option to apply filters. (general and plant version, help and tutorials)
The PANTHER classification system

Last but not least: finding literature:

NCBI - PubMed
Google Scholar
Web of Science (paid - University has access)
What about artificial intelligence? Try Perplexity!)

Tasks

2.1

Obtain and inspect the sequence of the pGWB4 cloning vector. Examine the various sequence formats available and the annotations. Using the knowledge from the previous lesson, construct a map of pGWB4.

2.2

Search the protein section of the GenBank/EMBL database for potato (Solanum tuberosum ) extensins.

Save at least five of the sequences you have found as a "multiple" FASTA file and keep them for future use. What is peculiar about the extensins?
Select one of the sequences and explore the "Related information" section (to the right from sequence description). Find the corresponding nucleotide sequence and examine the sequence file.

2.3

Search the protein section of the GenBank/EMBL database for Rab GDP dissociation inhibitors of Arabidopsis thaliana.

Download all the "Refseq" sequences you have found, save them as a "multiple" FASTA file, discard sequences that are not GDI )and keep the rest for future use. How many different RabGDIs did you find?
Explore the annotation of these sequences, paying attention to the locus identifiers (either in the "At1g59810" - i.e. "genomic" - or the "F21G12.120" - "BAC" - format). Use the genomic identifiers to annotate the sequences in the FASTA file produced in the previous step.Keep the file for future use.
Demo: Visit the SIGnAL database and search for available information on the loci you have found, using the genomic identiiers for a "Gene search". .Examine the expression of at least one of the found loci using BAR tools.

2.4

Use the plant BIOMART tool to, find all genes that are actin binding (using gene ontology term) in the Arabidopsis thaliana (TAIR10) database. In the final gene list, find the number of genes and identify at least 3 gene families, whose members bind actin in Arabidopsis thaliana.

Online tools