This page documents the requirements for supplying GO annotations to the GO Consortium (GOC). For more general information on GO annotation, please see the GO annotation guide.
- What are the minimum requirements to submit GO annotations?
- Complete Identifier Mapping Files
Submission of an annotation file in GAF2.0 format.
All GO annotation groups who would like to supply their annotations to the Consortium must supply an appropriately formatted annotation file that conforms to the Consortium's syntactic and semantic requirements. The primary GO annotation format is GAF:2.0. If you are a new annotation group and need assistance, please visit the Annotation FAQ page or contact the GO Consortium
Annotations should be made to UniProtKB protein accessions or NCBI gene product identifiers
Ideally, all annotations should describe the activities or locations of UniProtKB accessions present in the UniProt Reference Proteome Files. However if this is not possible, groups should provide identifier mapping files: gp2protein and gp2rna files, where equivalent UniProtKB or NCBI identifiers should be supplied. A gp_unlocalized file should additionally be provided where no sequence or genomic location is known for a gene identifier
Willingness to adopt Evidence Code identifiers
While the current primary annotation file format applies GO evidence codes to describe the category of support available in the cited reference, groups must support the Consortium's intent to transition to using Evidence Code Ontology (ECO) identifiers in future annotation formats.
Annotation update responsibility lies primarily with submitter, but can revert to the GO Consortium
Curational groups do not need to commit to supplying regular updates to their annotations. In the case of non-recurring submissions or those from annotation groups that are now inactive annotation providers, responsibility for corrections and updates will revert to the GOC.
Why do we need identifier (ID) mapping files?
- For downloading sequences from UniProtKB/NCBI. These sequences are used for AmiGO BLAST and for phylogenetic inferencing (PAINT)
- To search for GO annotations in AmiGO using other database cross reference IDs (UniProt or NCBI)
- The ID mapping will help with book keeping and tracking of IDs and annotations, removing duplicates etc.
- In all cases where identifier mapping is carried out, groups must be aware that due to different database release cycles, sequence identifiers that should correspond with each other may not always display the same data.
- The file must meet the gp2protein format specification
- The gp2protein mapping file must contain the full list of all protein-encoding genes in the respective organism (or community), including those proteins not annotated to GO.
- The first column contains all gene or gene product identifiers (these are typically MOD-specific identifiers) and the second column contains mappings to canonical identifiers. Protein coding genes must map to UniProtKB accessions (Swiss-Prot in preference, if not then TrEMBL). If identifiers are truly unavailable in UniProtKB then NCBI identifiers (NP_ and XP_) are permissible.
- If an annotation group is fully satisfied with the identifier mapping from an external identifier type to UniProtKB accessions, as supplied by the UniProt Knowledgebase cross-references, then UniProtKB is willing to take on the responsibility of supplying the external id -> UniProtKB mapping to the GO.
- The file must meet the gp2rna format specification
- If your annotation file includes non-coding RNAs (ncRNAs), then your corresponding gp2rna file must include all ncRNA-encoding genes currently identified in the genome build including those ncRNAs not annotated to GO.
- Functional ncRNA must map to NCBI (NR_ or XR_) if available, blank if unavailable
- The file must meet the gp_unlocalized format specification
- If your database supplies gene identifiers that have been manually curated from the literature, but where no sequence or genomic location is known (such genes have been variously described as 'unlocalised genes', 'single heritable traits' or 'phenotypic orphans'), then you should additionally supply a complete gp_unlocalized_file.
- This file should contain a list of all the non-genome localized gene identifiers available, including those not annotated to GO.
- If the annotation file includes macromolecular complexes as the subject of the annotation then no corresponding entry is required for the gp2protein file. Only gene or gene product mappings should be included.
- Updates of identifier mapping files Groups must regularly update their gp2protein or gp2rna file (e.g. in response to UniProtKB's feedback on inclusion of obsolete/secondary UniProtKB accessions in a group's gp2protein, or obsoletion of NCBI identifiers). For groups who provide authoritative files for a species, or who are funded by the GO NIH grant, please see the description of GO annotation activities by central GO Consortium members.