Publicly available databases

There are many, many databases around for sequence data and for downstream analysis of sequence data. Below we have listed some of hte most commone ones and their function. This is not an exhaustive list, but it can help you get started with finding relevant data to help with your analysis.

Finding data of interest

  • Pubmed: citations, abstract and links to more than 27 million scientific papers.
  • Google scholar
  • Dryad: a “curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable.” Integrated with many journals.
  • FigShare: is a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.

NCBI

NCBI has a lot of really wonderful resources. These all have different interfaces, and some are better organized than others, but the data housed within the various databases is gold.

  • GEO (Gene expression omnibus): gene expression data; array- and sequence-based data are catalogued within. Experimental design is also reported, although some experiments give more details than others.
  • Assembly: organisms with genomic assemblies.
  • Taxonomy: names of organisms, taxonomic ID. It can be a bit of a mess, but also really useful when you go to do anything with phylogeny.
  • Sequence read archive: raw data files. Can filter based on DNA, RNA, whole genome sequencing, organism, etc. See below for accessing & downloading this data.
  • WGS: Whole Genome Shotgun projects (complete or incomplete assemblies)
  • Many others!

Downloading data from NCBI

  • European nucleotide archive: links to fastq files. You can search for SRA project data here to download fastq files & avoid SRA format (below).
  • SRA Toolkit: command-line interface – recommended only if you have many samples to download

Other Protein data

  • UniProt
    • Uniprot is composed of 2 resources: Swissprot and TrEMBL. Swissprot is a databse of manually curated protein sequences (very high quality!) while trEMBL is automatically annotated (but contains a lot more sequences)
  • NCBI protein
    • A database that includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

Genomes & Genome Browsers

  • Ensembl: a genome browser that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species. Also contains reference genomes and annotation files that can be downloaded.
    • EnsemblMetazoa: The same as Ensembl but addition organisms that are not available on the primary Ensembl page.
  • UCSC Genome browser: contains many reference genomes, and many tools to search these including BLAT, in silico PCR, LiftOver (lift sequence from organism onto another) and many other cool and useful tools.
  • RefSeq: a non-redundant and well-annotated set of reference sequences including genomic, transcript, and protein.
  • GENCODE: high quality reference gene annotation and experimental validation for human and mouse genomes.
  • Joint Genome Institute

Other databases full of many things

  • GenBank: NIH genetic sequence database, an annotated collection of all publicly available DNA sequences
  • EBML-EBI: European bioinformatics institute.
  • DDBJ: DNA databank of Japan

BLAST

  • NCBI
  • UniProt
  • Ensembl: organism-specific blast supported.

Metagenomes

  • MG-RAST: full of great data, some reports of odd uploading and downloading so be careful when using this resource!
  • EBI metagenomics
  • UniMES: UniProt Metagenomic and Environmental Sequences

Marine organism resources

  • Aniseed: Ascidian Network for in situ Expression and Embryological Data
  • Echinobase: Several echinoderm genomes, expression data also available
  • OIST Marine Genomic Unit: Several genomes and omics data for various marine organisms

Tool Aggregators