Introduction to the GO resource

Because of the staggering complexity of biological systems and the ever-increasing size of datasets to analyze, biomedical research is becoming increasingly dependent on knowledge stored in computable form. The Gene Ontology (GO) project provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products.

The GO knowledgebase is composed of two primary components:

  • the Gene Ontology (GO), which provides the logical structure of the biological functions (‘terms’) and their relationships to one another, manifested as a directed acyclic graph
  • the corpus of GO annotations, evidence-based statements relating a specific gene product (a protein, non-coding RNA, or macromolecular complex, which we refer to hereafter as ‘genes’ for simplicity) to a specific ontology term

Together, the ontology and annotations aim to describe a comprehensive model of biological systems. Currently, the GO knowledgebase includes experimental findings from almost 140 000 published papers, represented as over 600 000 experimentally-supported GO annotations. These provide the core dataset for additional inference of over 6 million functional annotations for a diverse set of organisms spanning the tree of life.

In addition to this core knowledgebase, GOC resources also include software to edit and perform logical reasoning over the ontologies, web access to the ontology and annotations, and analytical tools that use the GO knowledgebase to support biomedical research.

Uses of the Gene Ontology and annotations

The most common use of the Gene Ontology annotations is for interpretation of large-scale molecular biology experiments, sometimes called "omics" experiments. These experiments measure either: 1) gene products (RNA and proteins), 2) variation in the DNA sequence of genes, or 3) small molecules metabolized by proteins. Thus they can all be related to gene function.

A typical omics experiment measures levels of thousands of molecules, making it difficult to interpret the underlying molecular changes (for example between a cancer cell and a normal cell). "Gene Ontology enrichment analysis" identifies relevant groups of genes that function together, which reduces the thousands of molecular changes to a much smaller number of biological functions, so that it is possible to understand what the molecular changes mean.

The Gene Ontology is also at the hub of a major effort to represent the vast amount of biomedical knowledge in a computable form. It is linked to many other biomedical ontologies, and is a foundation for research applying computer science in biology and medicine.

You can explore the scientific publications that have used the Gene Ontology resource.

Further reading about the Gene Ontology resource

For further guidance and reading, please see the following publications:

Ontology Documentation

The Gene Ontology defines the universe of concepts relating to gene functions (‘GO terms’), and how these functions are related to each other (‘relations’). It is constantly revised and expanded as biological knowledge accumulates. The GO describes function with respect to three aspects: molecular function (molecular-level activities performed by gene products), cellular component (the locations relative to cellular structures in which a gene product performs a function), and biological process (the larger processes, or ‘biological programs’ accomplished by multiple molecular activities).

Ongoing revisions to the ontology are managed by a team of senior ontology editors with extensive experience in both biology and computational knowledge representation. Ontology updates are made collaboratively between the GOC ontology team and scientists who request the updates. Most requests come from scientists making GO annotations (these typically impact only a few terms each), and from domain experts in particular areas of biology (these typically revise an entire ‘branch’ of the ontology comprising many terms and relations). We invite researchers and computational scientists to submit requests for either new terms or new relations in the ontology.

The GO ontology is structured as a directed acyclic graph where each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. The GO vocabulary is designed to be species-agnostic, and includes terms applicable to prokaryotes and eukaryotes, and single and multicellular organisms.

In an example of GO annotation, the gene product "cytochrome c" can be described by the Molecular Function term "oxidoreductase activity", the Biological Process term "oxidative phosphorylation", and the Cellular Component terms "mitochondrial matrix" and "mitochondrial inner membrane".

Ontologies

Molecular Function

Molecular function terms describes activities that occur at the molecular level, such as "catalytic activity" or "binding activity". GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when, or in what context the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are "catalytic activity" and "transporter activity"; examples of narrower functional terms are "adenylate cyclase activity" or "Toll receptor binding".

It is easy to confuse a gene product name with its molecular function; for that reason GO molecular functions are often appended with the word "activity".

Cellular Component

These terms describe a location, relative to cellular compartments and structures, occupied by a macromolecular machine when it carries out a molecular function. There are two ways in which biologists describe locations of gene products: (1) relative to cellular structures (e.g., cytoplasmic side of plasma membrane) or compartments (e.g., mitochondrion), and (2) the stable macromolecular complexes of which they are parts (e.g., the ribosome). Unlike the other aspects of GO, cellular component concepts refer not to processes but rather a cellular anatomy.

Biological Process

A biological process term describes a series of events accomplished by one or more organized assemblies of molecular functions. Examples of broad biological process terms are "cellular physiological process" or "signal transduction". Examples of more specific terms are "pyrimidine metabolic process" or "alpha-glucoside transport". The general rule to assist in distinguishing between a biological process and a molecular function is that a process must have more than one distinct steps.

A biological process is not equivalent to a pathway. At present, the GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.

Details about the ontologies

Annotation

Annotation is the process of assigning GO terms to gene products. The annotation data in the GO database is contributed by members of the GO Consortium, and the Consortium is continuously encouraging new groups to start contributing their annotations. The list of links below offer details on the GO annotation policies and the annotation process, as well as direct users to other pages of interest on GO annotation conventions, the standard operating procedures used by some consortium members, and the GO annotation file format guide.

File Format Guide

GO Formats

The GO File Format Guide documents the structure and syntax of the files available on the GO website, to assist users who need to read, write parsers for, or create these files. The following file formats are documented separately:

The combined GO annotation and ontology data is stored as a MySQL database; see the GO database documentation for more information, including the database schema.

OBO is fully supported via the OWL-API.

  • OBO format tools in GitHub:a wrapper for the Java (OWL-API) implementation of a parser for OBOF1.4 syntax and an implementation of the OBOF1.4 mapping to OWL (uses the OWL API)
  • OWL API in Github:a Java API for creating, manipulating and serialising OWL Ontologies.

 

Ontology Flat File Formats

The GO Consortium uses the OBO flat file format to store the ontology data. The current version is OBO 1.4, although the ontology data is also available in the previous version, OBO 1.2. The GO Consortium no longer uses or supports files in the legacy GO format. Should you require a file in this format, the command-line script obo2flat can be used to interconvert between OBO format and the legacy GO format. obo2flat is a Java script and comes as part of the OBO-Edit package; instructions on usage are provided in the OBO-Edit User Guide.

OBO-XML Format

OBO-XML is a direct XML serialization of the OBO 1.2 format specification. The schema is specified using RELAX-NG compact syntax: obo-xml.rnc. Currently, only the ontology is available as OBO-XML.

OWL RDF/XML Format

OWL is a standard for ontology languages, produced by the W3C. Details of the translation used for GO is available on the official OboInOwl page.

FASTA Format

Sequence data for gene products in the GO database is available in standard FASTA format from the GO database archives.

Contributing to GO

Research groups may contribute to the Gene Ontology Consortium (GOC) by providing suggestions for updating the ontology (e.g. requesting new terms) or by providing GO annotations.

How to contribute to the ontology

ontology-terms-icon.png

How to contribute GO annotations

If your research group has GO annotations for a species that is not currently included in the GO, whether or not these annotations cover the entire genome, or if your research team has identified gaps or inaccuracies in the current set of GO annotations, this guide is for you. Choose the scenario that best describes your research group and follow the steps as indicated in the following pages.

annotating-papers-small annotating-genes-proteins

Preparing GO Annotations for Submission

This page documents the steps required to take when supplying Gene Ontology annotations to the GO Consortium (GOC). For general information on how to conduct GO annotations, please see the GO Annotation Policies Guide.

Steps to prepare GO annotations for submission to the GOC

1. Contact the Gene Ontology Consortium

Please contact the GOC before carrying out the annotation work; this will ensure that GOC mentors and trainers can be of assistance in producing data sets in agreement with the GOC annotation policies and format requirements.

 

2. Provide a GAF2.0 formatted file

Research groups looking to supply Gene Ontology annotations to the Consortium must submit an appropriately formatted annotation file that conforms to syntactic and semantic requirements of the Consortium. The primary GO annotation format is the Gene Association Format (GAF) 2.0, or GAF2.0. This page contains details on how to build and populate the GAF2.0 File.

Please ensure that:

  • Submissions are made using this flat, tab-delimited format file: GAF2.0
  • The file has the correct file header
  • The file has the correct number of columns, even if some of them are not populated with data
  • If the file contains column names, these must be commented out using an exclamation mark ! at the start of the line
  • The file contains no leading or trailing spaces
 

2.1 Make annotations to UniProtKB accessions or NCBI identifiers

  • Human data, MODs: The ideal object identifiers for annotations are stable database identifiers. That is, ideally, all annotations should describe the activities or locations of protein accession from the UniProt KnowledgeBase (UniProtKB) that are present in the UniProt Reference Proteome Files.
  • Non-MODs: If this is not possible, research groups should first ensure that alternative identifiers are also stable, and then provide identifier mapping files (i.e. gp2protein, gp2rna; see below), where equivalent UniProtKB or NCBI identifiers should be supplied. A gp_unlocalized file should also be provided where no sequence or genomic location is known for a gene identifier.
  • If mapping to UniProtKB or NCBI identifiers is not a possibility: In this case the research group should contact the GOC to explore the alternatives.
 

2.2 Provide a database name

Each research group must provide a database name, which will be used to acknowledge the annotation set and to appropriately credit your work. This name would be visible in the 'assigned_by' field (Column 15) of all the annotation lines that the group is contributing. This name will also be added to the list of annotation providers.

 

2.3 Include bibliographic references

Each annotation line must include the citation of a bibliographic reference, which details the methods and results from which the annotation was made. The reference should be either a PubMed identifier or an abstract (GO_REF) describing how the annotation was made. Please see the Gene Ontology Reference Collection for a list of all current GO references.

 

3. State whether or not regular updates will be submitted

For research groups conducting curation, it is not always necessary to commit to supplying regular updates for their annotations. When the research team chooses to enter 'Longer-term Annotation Contribution / Collaboration' as the submitting group, a primary point of contact must also be identified so that requests may be redirected and proper action on such requests may be taken in a timely manner. The GOC will take responsibility for corrections and updates to datasets included in non-recurring submissions or those from annotation groups that become 'inactive' annotation providers.

   

4. Identifier Mapping Files

Providing complete identifier mapping files is necessary for:

  • Downloading sequences from UniProtKB and NCBI. These sequences are used for inferencing annotations in a phylogenetic context using the Phylogenetic Annotation and Inference Tool (PAINT).
  • Searching for GO annotations in AmiGO, using other database cross-reference IDs (UniProt or NCBI).
  • Helping to keep track of IDs and annotations, removing duplicates, etc.

Please be aware: when identifier mapping is carried out, due to different database release cycles, sequence identifiers that should correspond with each other may not always display the same data.

4.1 gp2protein file

  • The gp2protein format specifications are described here.
  • The gp2protein mapping file must contain the complete list of protein-coding genes in the respective organism (or community), including those proteins not annotated to GO.
  • The first column should contain all gene or gene product identifiers (these are typically MOD-specific identifiers). The second column should contain mappings to canonical identifiers. Protein coding genes must map to UniProtKB accessions (preferably Swiss-Prot, otherwise TrEMBL). If identifiers are unavailable in UniProtKB, NCBI identifiers (NP_ and XP_) are permissible.
  • If the annotation group is satisfied with identifier mappings from an external identifier type to UniProtKB accessions, as supplied by the UniProt Knowledgebase cross-references, then UniProtKB will take the responsibility of supplying the external ID -> UniProtKB mapping to the GO.

4.2 gp2rna file

  • The gp2rna format specifications are similar to those of gp2protein files. The differences between the two are described here.
  • If the annotation file includes non-coding RNAs (ncRNAs), then the corresponding gp2rna file must include all ncRNA genes currently identified in the genome build, including ncRNAs not annotated to GO.
  • Functional ncRNAs must map to NCBI (NR_ or XR_) if available; if unavailable, leave the field blank.

4.3 gp_unlocalized file

  • If your database supplies gene identifiers that have been manually curated from the literature, where no sequence or genomic location is known (such genes have been sometimes described as 'unlocalized genes', 'single heritable traits' or 'phenotypic orphans'), then you should additionally supply a complete gp_unlocalized_file.
  • This file should contain a list of all the non-genome localized gene identifiers available, including those not annotated to GO.
  • The file must meet the gp_unlocalized format specification, which should be similar to the gp2protein file format.

4.4 Exceptions for Macromolecular Complexes

  • If the annotation file includes macromolecular complexes as the subject of the annotation, no corresponding entry is required for the gp2protein file. Only gene or gene product mappings should be included.
  • Groups must regularly update their gp2protein or gp2rna files (i.e. in response to UniProtKB's feedback on inclusion of obsolete or secondary UniProtKB accessions in a group's gp2protein, or in the case NCBI identifiers are made obsolete). For groups who provide authoritative files for a species, or who are funded by the GO NIH grant, please consult the description of GO annotation activities by central GO Consortium members.