The Reference Genome Annotation Project

With more and more genomes being sequenced, we are in the middle of an explosion of genomic information. The limited resources to manually annotate the growing number of sequenced genomes imply that automatic annotation will be the method of choice for many groups. The GO Consortium coordinated an effort to maximize and optimize the GO annotation of a large and representative set of key genomes, known as 'reference genomes'. The goal of this project was to completely annotate twelve reference genomes so that those annotations may be used to effectively seed the automatic annotation efforts of other genomes.

Reference Species and Databases

The reference genomes and responsible database groups are: The Reference Genome GO Annotation Team, with trained and highly skilled GO curators from each genome annotation group, coordinated annotation, facilitated implementation of GO Consortium annotation priorities, and provided quantitative measures to assess progress toward the goal of broad and deep annotation of the reference genomes. This group represents the annotation expertise within the GO Consortium and provides key liaisons to the model organism databases that have primary responsibilities for the annotation of the reference genomes. Sequences used for building the trees can be obtained from the EBI's Reference Proteomes website. To access the complete proteomes for other species, please visit the UniProt Complete Proteomes collection.

Reference Genome Project publications

  • Reference Genome Group of the Gene Ontology Consortium. The Gene Ontology's Reference Genome Project: a unified framework for functional annotation across species. PLoS Comput. Biol.. Jul 2009;5(7):e1000431. PMID:19578431; doi:10.1371/journal.pcbi.1000431

Overview of project goal and strategy

The goals of the Reference Genome Project are:
  • provide a set of comprehensive experimental GO annotations for all gene products in all twelve Reference Genomes
  • provide tools for using these annotations to infer GO annotations for all fully sequenced genomes
Evolutionary relationships are the "glue" in the Reference Genome Project. Related genes across all twelve Reference Genomes are curated simultaneously, providing:
  • Better annotations: each model organism has unique strengths for probing gene function, and bringing this information together helps to interpret experimental results, which improves the accuracy and consistency of annotations.
  • More annotations: homology relationships allow accurate inference of functions for genes that have not been characterized experimentally
  • Improvements in the Gene Ontology: cross-organism discussion about annotations frequently leads to new terms being added to the Gene Ontology.

Curation process

  1. Identify the initial set of target genes in (typically) one species. Genes are selected that belong to either one of the four following categories:
    • Orthologs of human disease genes
    • Topical or ‘hot’ genes
    • Genes conserved from E. coli to human but currently lacking GO annotation
    • Genes involved in biochemical and/or signaling pathways
  2. Identify the ortholog(s)/homolog(s) of the selected target genes in all Reference Genome species, from phylogenetic trees in the PANTHER database. Not all species may have orthologs/homologs to selected genes
  3. Curators from each model organism database collect available literature about the genes in their respective organism.
  4. Curators assign GO terms based on experimental data.
  5. Review existing GO annotations to make sure they conform to agreed GO annotation standards (see below).
  6. Overlay all annotations on the phylogenetic tree of the gene family. Annotations are reviewed for consistency, and modified if necessary. Ancestral nodes in the tree are annotated, allowing reliable, traceable inference of annotations based on homology. These processes are carried out using the PAINT (Phylogenetic Annotation INference Tool) software operating on the trees in the PANTHER database.

How does this project differ from standard GO annotation?

The main results of this process are:
  • additional quality assurance of experimental GO annotations by viewing each annotation in the context of annotations for related genes
  • a set of high-quality inferred GO annotations derived from the annotated phylogenetic trees
  • a fully traceable evidence trail for all annotations, both experimental and inferred
The reference genome databases have agreed to follow more stringent guidelines than those used for standard GO annotation:
  • Experimental evidence codes (IDA, IPI, IMP, IGI, IEP) should be used where possible. The ultimate objective would be to provide experimentally-based annotations for all gene products from these organisms.
  • Terms inferred from sequence or structural similarity (ISS) should only be used where the terms are supported by experimental evidence for the similar sequence.
  • Non-traceable author statements (NAS) should not be used.
  • No new annotations should be based on traceable author statement (TAS); existing terms assigned with TAS should gradually be replaced with the appropriate experimental evidence code based on the primary literature.

Where can GO annotations from the project be viewed?

All GO annotations from this project are included in the gene association files that each group submits to GO. Annotations can also be viewed using the GO search engine and browser AmiGO. Annotated families can be viewed with the homolog set browser.

Concluding Remarks

This project aims to improve annotations across a wide range of organisms. The resulting high quality annotations will no doubt improve electronic annotations that propagate from this resource and annotations will facilitate cross-species functional comparison. Furthermore, the easy comparison of annotations between organisms may lead to new hypotheses and thus inspire new exciting research.